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Chapter  I 
INTRODUCTION 


1.1  A  Brief  Survey  of  Multiprocessor  Systems 

The  advent  of  large  scale  integrated  circuit  has  created  a  tremen- 
dous impact  on  computer  system  design.  In  particular,  LSI  technology  has 
made  great  strides  in  the  areas  of  memory  packaging  and  microprocessor  de- 
sign. The  existence  of  inexpensive  but  powerful  microprocessors  and  extremely 
high  density  semiconductor  memory  chips  have  led  people  to  consider  the  design 
of  large  computer  systems  incorporating  a  large  number  of  processors  and 
memory  modules. 

In  fact,  the  idea  of  using  multiple  processing  units  to  handle 
various  functions  of  the  whole  system  is  not  new.  People  started  thinking 
and  building  machines  with  multiple  PEs  at  least  twenty  years  ago.  Back  in 
1958,  Unger  [1]  designed  a  machine  to  perform  pattern-recognition  processing, 
which  consisted  of  a  central  control  computer  and  a  processing  element  array. 
During  the  same  year,  three  other  systems  were  designed  and  manufactured, 
namely,  National  Bureau  of  Standards'  PILOT  system  [2]  and  USAF's  AN/FSQ-31 
and  32  air  defense  systems.  Although  these  old  machines  do  not  quite  fit 
into  the  definition  of  multiprocessor  system  commonly  accepted  by  people 
today  (since  some  of  them  we  prefer  to  call  Multiple-Computer  Systems),  they 
do  show  the  approach  people  used  to  speed  up  their  systems,  i.e.,  using 
several  processing  units  to  carry   out  several  operations  at  the  same  time. 

Before  we  go  on,  we  would  like  to  present  a  definition  of  a  multi- 
processor in  order  to  clearify  some  ambiguities.  According  to  the  American 
National  Standard  Vocabulary  for  Information  Processing  [3],  a  multiprocessor 


system  is  defined  as  "a  computer  employing  two  or  more  processing  units 
under  integrated  control."  A  better  definition  was  proposed  by  Enslow  [4]. 
He  defines  a  multiprocessor  to  be  a  system  with: 

•  Two  or  more  processing  units 

•  Shared  common  memory 

•  Shared  I/O  channels,  control  units,  and  devices 

•  Single  integrated  operating  system 

•  Hardware  and  software  interaction  at  all  levels 

We  will  use  this  definition  throughout  this  report.  Obviously,  a  group  of 
computers  connected  by  some  communication  means,  such  as  ARPANET  which  does 
not  have  all  these  five  characteristics,  does  not  qualify  to  be  and  will  not 
be  called  a  "multiprocessor"  system. 

So,  the  first  "true"  multiprocessor  under  this  definition  should 
be  Burroughs'  D-825  system  [4,5,6,7]  announced  in  1960.  A  lot  of  multiproces- 
sor machines  have  been  designed  and  built  since  then,  for  example,  Burroughs 
B-5000,  IBM  704X/709X,  CDC  6600,  Univac  1108/1110,  etc.  A  complete  list  can 
be  found  in  [5],  and  a  wery   good  bibliography  in  [8]. 

Most  of  the  multiprocessor  systems  only  have  a  small  number  of 
processors,  say  2  to  10.  This  is  not  surprising  because  they  were  built 
before  LSI  became  popular  and  the  hardware  was  still  very  expensive.  Only 
a  few  machines  were  designed  to  have  a  large  number  of  processing  units. 
The  most  famous  one  is,  of  course,  ILLIAC  IV  as  well  as  its  two  predecessors, 
SOLOMON  I  and  SOLOMON  II,  which  were  designed  by  Slotnick,  et  al  [9,10]  to  work 
on  problems  involving  differential  equations,  linear  algebra,  and  weather 
data  processing.  Since  all  these  problems  contain  a  lot  of  matrix  operations, 
sometimes  thousands  by  thousands,  they  do  need  a  machine  with  a  large  array 


of  PEs  in  order  to  get  a  reasonably  fast  response  time. 

In  the  past  few  years,  the  tremendous  improvement  in  circuit  per- 
formance and  the  drastic  reduction  in  hardware  price  have  made  the  multi- 
processor design  even  more  attractive.  In  particular,  the  advent  of  the 
LSI  microprocessor  has  brought  the  system  designer  into  a  new  world.  People 
start  building  systems  by  using  tens,  hundreds,  or  even  thousands  of  cheap 
but  very  powerful  microprocessors.  Recently,  several  projects  have  been 
proposed,  e.g.  [11],  to  construct  systems  with  1024  or  more  processing  ele- 
ments. Only  the  very   low  cost,  say  a  few  hundred  dollars  per  PE,  can  make 
this  kind  of  design  possible.  This  was  still  a  dream  even  five  years  ago. 

Of  course,  a  lot  of  questions  arise  in  this  kind  of  new  design. 
For  example,  how  do  we  interconnect  so  many  processors,  how  do  these  processors 
share  resources  and  communicate  with  each  other,  how  do  we  control  the  opera- 
tion of  the  whole  system  and  fully  utilize  the  hardware.  Needless  to  say, 
all  these  questions  need  to  be  answered  satisfactorily  before  we  can  come 
out  with  a  good  design.  People  are  getting  more  and  more  concerned  with 
these  problems.  It  is  our  intention  to  make  a  thorough  study  of  these 
problems  in  order  to  get  a  better  understanding  of  how  to  design  such  a 
multiprocessor  system. 

Before  we  try  to  answer  those  questions,  we  would  like  to  briefly 
discuss  three  well-known  systems  to  give  readers  an  idea  of  what  kind  of 
system  we  are  dealing  with.  They  are:  PRIME  system  at  the  University  of 
California  at  Berkeley  [12],  C.mmp  system  at  Carnegie-Mellon  University  [13], 
and  Tandem  16  NonStop  system  by  Tandem  Computers,  Inc.  [14]. 

All  these  systems  are  made  up  with  a  certain  number  of  microproces- 
sors and  memory  modules.  However,  due  to  a  different  set  of  design  objectives, 


e.g.,  degree  of  resource  sharing,  expandability,  etc.,  each  system  has  a 
completely  different  architecture  and  operating  system  design  philosophy. 
For  example,  they  use  three  fundamentally  different  interconnection  schemes 
to  connect  their  processors  and  memories,  namely: 

•  Multiport  memories 

•  Crossbar  switch 

•  Time-shared  common  bus 

We  would  like  to  list  their  differences,  and  try  to  compare  their  advantages 
and  disadvantages.  Hopefully,  we  can  learn  something  from  this  study  which 
can  be  used  as  a  valuable  design  guide  in  the  future. 

1.2  Comparison  of  Three  Multiprocessor  Systems 
1.2.1  PRIME  System 

The  PRIME  system  is  a  medium-size,  general -purpose  time-sharing 
system  whose  design  is  aimed  at  improving  the  cost/performance  ratio,  relia- 
bility and  privacy  of  current  time-sharing  systems.  Figure  1  shows  the  sys- 
tem architecture  of  the  PRIME  system.  It  consists  of  five  Meta  4  microproces- 
sors (by  Digital  Scientific  Corporation)  and  thirteen  four-port  memory  modules, 
Every  processor  is  connected  to  eight  memory  modules  via  a  dedicated  proces- 
sor bus,  so  it  is  a  multiport  memory  connection.  (Since  each  processor 
only  connects  to  about  two-thirds  of  the  total  memory,  we  will  call  this  a 
"partial"  connection  in  later  discussions.) 

Meta  4  is  a  microprogrammable  microprocessor  that  has  a  processor 
cycle  time  of  100  ns.  It  operates  on  16-bit  operands  and  has  a  32-bit  micro- 
store. The  MOS  memory  is  16  bits  per  word,  which  has  a  400  ns  access  time 
and  a  600  ns  cycle  time.  Each  four-port  memory  module  has  8K  words  made 
up  from  two  4K  word  submodules.  There  is  a  four-by-two  switch  inside  each 
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Figure  1.  Structure  of  the  PRIME  System. 


module  which  can  connect  a  memory  port  to  any  submodule.  This  memory  orga- 
nization allows  two-way  interleaving  inside  a  module. 

All  the  peripheral  devices  (except  user  terminals)  are  connected 
to  the  processors  and  the  memories  via  a  big  interconnection  network  called 
External  Access  Network  (EAN),  which  is  essentially  a  crossbar  network  [15]. 
The  network  is  controlled  by  five  I/O  control  boxes.  These  I/O  control 
logic  units  not  only  contro  the  information  flow  in  and  out  of  the 
peripheral  devices  but  also  control  some  inter-processor  communication. 

At  any  given  time,  the  whole  system  is  partitioned  into  five 
physical ly  separated  subsystems  [16].  No  two  subsystems  will  share  the  same 
memory  module  or  disk  space.  This  is  to  achieve  high  privacy,  which  is  very 
important  in  a  multiprocessor  system.  One  of  these  five  subsystems  is  as- 
signed to  be  the  "control  subsystem,"  and  the  rest  "program  subsystems." 
All  the  users  will  compete  for  these  four  program  subsystems. 

The  operating  system  [17]  is  also  partitioned  into  two  parts, 
namely,  the  Central  Control  Monitor  (CCM)  and  External  Control  Monitor  (ECM), 
as  shown  in  Figure  2.  The  control  subsystem  is  assigned  the  Central  Control 
Monitor  and  each  program  subsystem  has  a  copy  of  External  Control  Monitor. 
CCM  is  the  centralized  part  of  the  system  wide  operating  system  which  controls 
all  the  system  tasks  like  job  scheduling,  resource  allocation,  interrupt 
handling,  and  inter-processor  communication.  CCM  also  monitors  the  I/O  con- 
trol boxes  to  determine  the  connections  made  in  the  interconnection  network. 
Whenever  a  program  processor  wants  to  talk  to  the  other  program  processor 
or  access  a  peripheral  device,  it  must  send  a  request  to  CCM  and  seek  the 
permission  from  CCM.  Then,  CCM  will  make  the  connection  by  telling  the 
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Figure  2.  Structure  of  the  PRIME  Operating  System. 


interconnection  network  to  do  so. 

ECM,  on  the  other  hand,  is  the  local  representative  of  the  operating 
system  at  each  subsystem,  including  the  control  subsystem.  It  will  perform 
the  local  management  functions  related  to  processes  running  on  that  sub- 
system, e.g.,  teletype  I/O  for  the  teletypes  physically  connected  to  the 
subsystem's  processor,  swapping  out  the  current  process  and  swapping  in  the 
next  process,  etc.  It  also  controls  the  communication  between  user  processes 
and  the  CCM,  and  does  the  independent  verification  of  CCM  decisions.  In  fact, 
all  five  ECMs  are   working  like  a  communication  subnet. 

Each  subsystem  also  has  a  Local  Monitor  (LM).  Every  user  can  de- 
fine his  own  LM  to  control  all  the  intra-subsystem  tasks,  e.g.,  the  manage- 
ment of  resources  allocated  to  him  the  generation  of  interrupts,  etc.  So, 
the  software  is  modularized  and  partially  distributed  into  all  the  subsystems. 
This  is  a  \/ery   important  factor  for  achieving  high  availability. 

For  reliability  reasons,  any  subsystem  can  become  the  control  sub- 
system. Whenever  there  is  a  failure  in  the  current  control  subsystem  (which 
may  be  detected  by  other  subsystem's  ECM),  any  other  subsystem  can  take  over 
the  job  immediately.  If  one  program  subsystem  goes  down,  the  whole  system 
only  suffers  a  performance  degradation  of  25%. 

Hence,  the  PRIME  system  is  a  highly  reliable,  highly  secure,  and 
highly  available  system.  Besides,  due  to  its  multiport  memory  connection, 
it  is  yery   easy  to  expand  and  reconfigure. 

Of  course,  the  physical  boundary  between  two  subsystems  essentially 
eliminates  the  possibility  of  code  sharing  by  two  processes.  Thus,  some 
software  duplication  is  needed  which  effectively  reduces  the  available 
memory  in  each  subsystem.  However,  the  designers  do  not  consider  this  as  a 


drawback.  A  paper  by  Ravi  [18]  points  out  that  code  sharing  will  actually 
generate  more  cons  than  pros,  e.g.,  the  system  will  need  higher  memory  band- 
width due  to  the  higher  memory  interference  caused  by  competing  processes. 

There  are  two  yery   interesting  things  we  would  like  to  point  out. 
First,  each  program  subsystem,  and  hence  each  processor,  is  dedicated  to  one 
user  job  until  this  job  is  swapped  back  onto  the  disk.  A  user  job  will  not 
be  brought  into  the  main  memory  unless  there  is  a  free  processor  and  the 
available  memory  space  attached  to  it  is  large  enough.  So,  at  any  given  time, 
at  most  four  user  programs  reside  in  the  main  memory  being  executed.  In 
other  words,  the  PRIME  system  does  not  allow  more  than  one  job  to  be  executed 
by  a  processor  subsystem  at  the  same  time.  It  would  seem  that  they  are 
not  fully  utilizing  the  processors.  However,  there  is  no  overhead  due  to 
changing  of  jobs,  e.g.,  swapping  out  the  status  information  of  the  current 
job  and  reconfiguring  a  new  subsystem.  Furthermore,  the  operating  system 
is  much  simpler  which  increases  the  software  reliability. 

The  second  thing  is  the  partial  connection  scheme.  The  physical 
partition  sometimes  will  eliminate  the  chance  for  a  new  job  to  enter  the 
system,  even  though  the  total  free  memory  space  is  large  enough  and  there 
is  some  processor  available.  Of  course,  there  is  some  amount  of  performance 
degradation  due  to  this  fact.  We  are  interested  in  finding  out  how  bad  this 
is,  compared  to  a  more  expensive  full  connection  scheme  1  ike  a  crossbar  switch. 

Needless  to  say,  the  PRIME  system  does  provide  us  a  lot  of  interesting 
subjects  to  study.  We  will  discuss  them  in  more  detail  later. 
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1.2.2  C.mmp  System 

C.mmp  is  the  multi-mini-processor  system  at  the  computer  science 
department  of  Carnegie-Mellon  University  [13,19].  The  overall  architecture 
is  shown  in  Figure  3.  It  consists  of  sixteen  PDP  11  minicomputers  connected 
through  a  crossbar  switch  to  sixteen  memory  modules.  Every  processor  (Pc, 
a  modified  PDP  11  processor)  can  access  any  memory  module  via  the  crossbar 
switch.  So,  the  memory  is  completely  shared  by  all  processors.  This  is 
one  basic  difference  between  PRIME  and  C.mmp.  We  will  call  this  kind  of 
connection  a  full  connection. 

C.mmp  is  designed  for  solving  large  artificial  intelligence  prob- 
lems. This  kind  of  problem  needs  a  number  of  processors  to  work  on  a  large 
common  data  base  simultaneously  in  order  to  obtain  the  answer  in  real  time. 
So,  complete  memory  sharing  is  crucial,  and  that  is  why  it  uses  a  16  x  16 
crossbar  switch  for  the  memory- processor  connection.  Of  course,  there  will 
be  some  memory  contention  due  to  memory  sharing.  In  the  next  chapter,  we 
will  give  an  analytical  solution  for  this  problem.  However,  the  memory 
contention  can  be  reduced  by  using  local  (or  private)  memory.  In  C.mmp, 
each  processor  has  a  4K  local  memory  which  is  not  shared  by  other  processors 
(Figure  3) . 

Each  processor  can  be  a  slightly  modified  version  of  any  model 
in  the  PDP  11  family.  In  the  first  stage  of  implementation,  five  PDP  11/20's 
were  installed.  Another  four  PDP  11/40's  were  scheduled  to  be  added  in  the 
sunmer  of  75.  The  PDP  11/40  operates  on  16-bit  operands  and  has  a  processor 
cycle  time  of  650  ns.  Notice  that,  although  the  processor  cycle  time  of 
PDP  11/40  is  much  larger  than  that  of  Meta  4,  it  does  not  mean  the  PDP  11/40 
has  a  smaller  instruction  execution  rate.  The  instruction  execution  rate 
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is  determined  by  the  number  of  cycles  each  instruction  will  take.  For 
PDP  11/40,  most  of  the  instructions  only  take  one  or  two  processor  cycles. 
It  averages  roughly  0.44  million  instructions  per  second.  On  the  other 
hand,  Meta  4  has  the  similar  rate  since  each  instruction  needs  several  micro- 
instructions to  execute. 

Both  PDP  11/20  and  40  use  core  memory  which  has  a  500  ns  access 
time  and  1.2  ms  cycle  time.  Although  the  memory  speed  is  not  wery   fast, 
we  can  interleave  a  program  into  all  16  modules  to  get  a  high  memory  band- 
width. 

One  other  area  where  C.mmp  differs  from  PRIME  is  that  peripheral 
devices  are  not  shared.  Each  peripheral  device  is  connected  to  the  unibus 
of  a  processor  and  can  only  be  used  by  this  processor.  Hence,  the  processors 
must  use  the  primary  memory  for  interchanging  information.  Both  this  and 
memory  sharing  are  possible  sources  of  violating  privacy.   Software 
protection  is  a  yery   important  issue  in  the  operating  system  design. 

Although  the  main  purpose  of  C.mmp  is  to  use  the  system  as  a 
whole  to  work  on  a  large  program,  it  can  also  be  partitioned,  either  dynami- 
cally or  statically  (manually),  into  several  independent  subsystems  and 
operated  in  a  fashion  like  PRIME.  Due  to  the  partitionability  of  the  cross- 
bar switch,  the  hardware  can  be  partitioned  into  two,  three,  or  even  16 
totally  separated  subsystems.  This  provides  a  great  ease  for  maintenance, 
since  if  any  processor  or  memory  module  is  down,  it  can  be  isolated  from 
the  rest  of  the  system  and  turned  over  to  the  hardware  engineer  for  replace- 
ment. Thus,  it  would  not  require  taking  the  entire  machine  down  for 
maintenance. 

Unlike  the  PRIME  system,  C.mmp  does  not  designate  a  single 


13 


processor  as  the  control  processor.  This  is  because  C.mmp  is  designed  to 
have  up  to  16  processors,  and  when  the  number  of  processors  increases,  the 
master  (or  control)  processor  quickly  becomes  a  bottle-neck.  This  is  the 
reason  why  PRIME  can  only  have  a  small  number  of  processors  (5),  since  Meta  4 
is  a  relatively  slow  minicomputer.  However,  this  means  that  each  processor 
in  C.mmp  should  have  its  own  copy  of  the  operating  system  if  it  is  working 
alone.  In  order  not  to  occupy  too  much  memory,  the  size  of  the  operating 
system  should  be  somehow  minimized  but  still  meet  all  the  users'  requirements. 
Certainly,  this  needs  a  special  kind  of  operating  system  design. 

HYDRA,  the  operating  system  for  C.mmp,  is  designed  for  this 
purpose  [20].  The  central  core  of  HYDRA  is  a  "kernel"  set  of  operating 
system  facilities  which  provide  both  basic  protection  and  management  of  the 
hardware  resource.  However,  the  kernel  does  not  provide  software  for  things 
like  the  file  system,  job  control  language,  or  scheduling  policy.  These 
are  supplied  by  the  user. 

This  approach  has  several  advantages.  First,  the  user  has  the 
freedom  to  define  his  own  operating  system,  for  example,  a  job  control 
language.  This  not  only  allows  the  user  to  minimize  the  size  of  his  operating 
system,  but  also  allows  him  to  specify  some  facility  not  provided  by  the 
existing  programs  or  to  replace  an  existing  facility  by  one  more  closely 
attuned  to  his  own  needs.  Second,  an  error  in  one  user  operating  system 
can  only  affect  his  own  program.  It  will  not  crash  the  entire  system.  This 
greatly  increases  the  reliability  of  the  software.  Since  the  kernel  is 
rather  small  and  well-defined,  an  error  is  \/ery   rare. 

In  the  C.mmp  system,  a  program  usually  can  be  run  on  any  available 
processor.  This  is  why  it  needs  a  crossbar  switch  in  order  to  provide  a  full 
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connection.  Of  course,  the  processor  utilization  will  be  higher  than  that 
on  PRIME.  However,  we  can  see  there  must  be  a  big  overhead  associated  with 
job  swapping,  especially  when  each  user  has  defined  his  own  operating 
system.  We  will  show  later  that  this  scheme  might  not  be  a  good  idea  in 
some  cases. 

The  use  of  a  crossbar  switch,  of  course,  has  some  disadvantages: 
first,  it  is  very   expensive;  second,  it  is  not  easy  to  expand.  C.mmp  can 
have  a  maximum  of  sixteen  processors  and  sixteen  memory  modules.  Although 
if  we  use  PDP  ll/40s,  the  system  can  yield  up  to  7  million  instructions  per 
second  (mips),  which  is  comparable  to  an  IBM  370/158,  it  will  be  very   dif- 
ficult and  expensive  to  expand  the  system  beyond  sixteen  processors.  This 
scheme  certainly  will  not  work  in  future  systems  where  we  might  have  thousands 
of  processors. 

But,  in  general,  C.mmp  is  a  very   reliable,  both  in  software  and 
hardware,  highly  available,  and  easy  to  maintain  system.  In  particular,  the 
ideas  of  HYDRA  will  be  very   helpful  in  operating  system  design  for  future 
multiprocessor  systems. 

1.2.3  NonStop  System 

Figure  4  shows  the  architecture  of  a  recently  announced  multiproces- 
sor; Tandem  Computers'  NonStop  System  [14].  This  system,  configured  with 
up  to  16  minicomputers,  is  designed  to  handle  heavy  banking  transaction 
processing  and  provides  very  high  availability.  The  basic  difference  is 
that  the  processors  are  connected  together  by  a  pair  of  time-shared  common 
buses.  Processors  will  communicate  with  each  other  via  these  buses. 

The  use  of  common  bus  offers  the  advantages  of  very   low  cost  and 


15 


DUAL  REDUNDANT  COMMUNICATION  BUSES 


o 


PROCESSOR 
MODULE 


DISK 
CONTROLLER 


=> 


u 

ii 

_  _  ii_  _ 


o 


TAPE 
CONTROLLER 


i     i 
I I 


H 


COMMUNICATION 
CONTROLLER 


Figure  4.  Tandem  16  NonStop  System. 
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the  ease  to  modify  the  hardware  configuration.  For  example,  we  can  add 
or  remove  a  functional  unit  fairly  easily.  However,  the  overall  system 
performance  is  limited  by  the  bus  transfer  rate,  and  the  failure  of  the  bus 
will  cause  a  catastrophic  disaster.  Hence,  NonStop  uses  dual  redundant  buses 
to  increase  the  transfer  rate  and  the  availability. 

Each  processor  module  is  actually  a  complete  minicomputer,  having 
its  own  control  unit,  arithmetic  unit,  private  memory,  and  its  own  copy  of 
the  operating  system.  So,  e\/ery   processor  has  the  ability  to  keep  working 
even  if  all  other  processors  are  down.  Also,  whenever  a  processor  goes 
down,  other  processors  can  take  over  without  much  difficulty. 

There  is  no  memory  sharing,  instead  the  processors  share  the 
peripheral  devices  (e.g.,  disk  and  tape)  via  the  controllers.  This  is  be- 
cause the  system  is  designed  for  handling  banking  transactions  and  all  the 
processors  are  supposed  to  work  on  a  big  data  base  on  disks  or  tapes. 
Therefore,  unless  each  minicomputer  can  provide  a  large  amount  of  primary 
memory,  the  use  of  this  kind  of  architecture  is  perhaps  only  appropriate 
for  data  base  management. 

The  number  of  processor  modules  that  can  be  attached  to  the  common 
data  bus  seems  to  be  unlimited.  However,  as  the  number  of  processors  in- 
creases, the  bus  contention  increases  drastically.  This  will  seriously 
degrade  the  system  throughput.  Besides,  the  longer  the  bus  is,  the  larger 
the  time  skew  is,  and  the  slower  the  clock  rate  will  be.  So,  the  expanda- 
bility is  limited  by  a  number  of  constraints.  When  the  workload  grows  beyond 
a  certain  limit,  this  architecture  will  no  longer  be  able  to  expand  and 
perform  satisfactorily. 

The  I/O  is  controlled  by  the  communication  controller.  The 
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controller  can  assign  a  task  (transaction)  to  any  available  processor. 
This  can  achieve  a  high  availability  and  utilization  of  processors. 

Perhaps  the  most  appealing  design  is  their  software.  Guardian, 
the  operating  system  of  NonStop,  is  a  virtual  memory  system  which  contains 
automatic  re-entrant,  recursive  and  shareable  codes.  Whenever  a  component 
fails,  Guardian  automatically  reassigns  both  processor  and  I/O  resources  to 
ensure  that  in-process  tasks  including  file  updates  are  completed  correctly. 
This  guarantees  the  process  can  be  restarted  in  a  very   short  time.  For  a 
system  that  provides  high  availability,  this  type  of  action  is  extremely 
important. 

When  one  of  the  disks  fails  in  the  middle  of  a  file  update,  Enscribe, 
Tandem's  NonStop  data  base  manager,  ensures  that  the  damaged  record  or  file 
is  restored.  Enscribe  uses  a  duplicate  file  technique  to  continue  the 
operation  by  using  the  back-up  file.  Hence,  the  faulty  disk  will  not 
cause  any  interruption  of  service. 

Overall,  NonStop  uses  redundant  hardware  and  duplicated  software  in 
order  to  give  the  user  continued  service  without  any  interruption  or  termina- 
tion. This  is  very   important  to  a  system  where  the  user  cannot  afford  any 
system  downtime,  for  example,  a  telephone  switching  network,  or  an  online 
banking  situation.  However,  this  might  not  be  a  good  candidate  for  a 
scientific  research  environment. 

1.2.4  Overall  Comparison 

We  went  through  three  multiprocessor  systems  very   briefly  in  the 
last  three  sections.  As  we  can  see,  each  system  has  a  different  architecture, 
and  its  own  advantages  and  disadvantages.  It  is  not  fair  to  say  which  system 
is  the  best  or  which  one  is  better  than  the  other  one  since  they  have 
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different  design  goals.  However,  they  do  provide  us  with  a  lot  of  interesting 
problems  to  look  at. 

In  the  next  section,  we  will  list  a  number  of  problems  we  want 
to  study.  Before  we  do  that,  we  will  give  a  comparison  of  these  three  systems 
in  Table  1.  But  first,  let  us  give  the  definitions  for  the  terms  "monopro- 
gramming" and  "multiprogramming"  we  use  in  Table  1.  These  two  terms  will 
be  used  very  often  in  our  later  discussion. 

In  a  single  processor  system,  or  sometimes  called  a  monoprocessing 
system,  the  distinction  of  these  two  terms  can  be  easily  made.  If  at  most 
one  job  can  be  in  execution  in  the  system,  i.e.,  the  system  only  executes 
one  job  at  a  time,  it  is  in  general  called  a  monoprogrammed  system.  In  fact, 
it  should  be  called  a  monoprocessing,  monoprogrammed  system.  Many  minicom- 
puter systems  belong  to  this  category.  On  the  other  hand,  if  more  than  one 
job  can  be  in  execution  at  the  same  time,  it  is  called  a  monoprocessing, 
multi programmed  system.  Of  course,  the  advantage  of  a  multi programmed  sys- 
tem is  that  the  processor  can  execute  a  job  while  the  other  jobs  are  doing  I/O 
This  can  increase  the  processor  utilization  and  in  general  shorten  the  turn- 
around time  for  a  job.  Almost  all  the  large  systems  are   operating  in  this 
mode,  e.g.,  our  IBM  360/75  system. 

In  a  multiple  processor  system,  however,  the  situation  is  more 
complicated.  Of  course,  if  the  system  still  only  allows  one  job  in  execu- 
tion, i.e.,  all  the  processors  are  working  on  one  job,  it  is  called  a  multi- 
processing, monoprogrammed  system.  A  SIMD  machine  like  Illiac-4  belongs  to 
this  category,  where  all  the  processors  are  under  the  control  of  one  single 
instruction  stream.  The  C.mmp  system  sometimes  also  operates  in  this  mode 
when  all  the  PDP  ll's  are  running  one  giant  artificial  intelligence  program. 
In  that  case,  it  should  be  considered  as  a  multiprocessing,  monoprogrammed 
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system.  However,  when  more  than  one  job  is  allowed  in  a  multiprocessor 
system,  which  is  the  case  in  the  system  we  will  deal  with,  the  classifica- 
tion is  more  complicated  and  sometimes  rather  confusing.  For  example,  it 
is  difficult  to  classify  a  system  where  several  processors  are  executing  one 
job  while  many  other  jobs  are  running  on  one  other  processor.  If  we  follow 
the  way  the  other  three  kinds  of  systems  are  classified,  this  kind  of  sys- 
tem should  be  called  a  multiprocessing,  multi programmed  system.  But, 
multiprogramming  is  usually  used  for  a  system  where  each  processor  is  able 
to  handle  more  than  one  job.  It  is  rather  unfair  to  call  all  the  multiproces- 
sor systems  that  allow  more  than  one  job  to  be  executed  at  the  same  time 
multiprocessing,  multi programmed  systems. 

Fortunately,  we  will  only  talk  about  this  kind  of  system,  i.e.,  a 
multiprocessor  system  that  allows  several  jobs  to  be  executed  at  the  same 
time.  So,  we  do  have  some  freedom  to  use  these  two  terms  and  can  avoid 
much  confusion. 

In  the  rest  of  this  thesis,  we  will  use  monoprogramming  to  mean  a 
system  that  allows  each  processor  to  have  at  most  one  job  at  any  given  time. 
That  is,  each  processor  will  be  dedicated  to  a  certain  job  throughout  the 
lifetime  of  this  job  and  cannot  execute  any  other  jobs.  While  a  job  is 
doing  I/O,  its  corresponding  processor  will  be  idling.  From  a  processor's 
standpoint,  it  seems  that  there  is  only  one  job  in  the  memory.  This  is  why 
we  use  the  term  monoprogramming  even  though  there  might  be  more  than  one 
job  executing  in  the  whole  system  at  the  same  time.  The  PRIME  system  falls 
into  this  category  and  hence  will  be  called  a  monoprogrammed  system.  A 
consequence  of  using  monoprogramming  is  the  number  of  jobs  in  execution  will 
never  be  larger  than  the  number  of  processors  in  the  system. 
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On  the  other  hand,  multiprogramming  will  be  used  to  indicate 
a  system  that  allows  more  jobs  than  processors  to  be  in  the  memory  at  the 
same  time.  Therefore,  each  processor  might  be  responsible  for  more  than  one 
job  at  a  certain  time  instant.   In  a  multi programmed  system,  there  need 
not  be  any  linking  between  any  job  and  any  processor.  All  the  jobs  can  be 
thought  to  be  in  a  "pool."  Whenever  a  processor  is  free,  it  will  grab  a 
job  in  the  pool  to  execute.  The  C.mmp  system  sometimes  operates  in  this 
mode,  so  we  classify  C.mmp  as  a  multiprogramming  system  in  Table  1. 

Although  all  these  systems  have  quite  different  software  and 
hardware  structures,  as  we  can  see  from  Table  1,  they  all  have  the  following 
advantages:  high  availability,  high  hardware  reliability,  high  flexibility, 
relatively  low  cost,  and  ease  of  maintenance.  These  are  the  main  design  objec- 
tives of  the  multiprocessors.  Most  of  all,  they  all  provide  very  high 
throughput  due  to  concurrent  operation  of  the  processors. 

However,  the  multiprocessors  also  have  some  disadvantages.  For 
example,  the  ability  to  expand  a  system  is  usually  limited  by  the  hardware 
capabilities  like  the  processor  speed,  bus  transfer  rate,  etc.  In  addition, 
the  system  software  is  rather  complicated,  which  makes  it  difficult  to 
design,  expensive  to  produce,  difficult  to  check  out,  and  difficult  to  main- 
tain [4,5].  The  design  of  a  simple,  self-checking  operating  system  should 
be  an  important  subject  in  multiprocessor  design.  We  will  not  go  into  this 
subject  here.  However,  we  do  want  to  study  the  performance  tradeoffs  of 
some  simple  strategies,  for  example,  monoprogramming  versus  multiprogramming. 
These  results  will  be  shown  in  Chapter  3  later. 
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1.3  Major  Design  Questions 

In  this  section,  we  will,  describe  all  the  questions  we  are  inter- 
ested in.  Hopefully,  this  can  give  the  reader  a  complete  idea  of  what  we 
are  after.  Although  these  questions  only  cover  a  subset  of  problems  a  sys- 
tem designer  must  be  concerned  about,  we  do  believe  they  are  basic  and  im- 
portant issues  in  designing  a  multiprocessor  system.  We  will  roughly  sub- 
divide these  questions  into  two  groups,  namely,  software  and  hardware  related 
questions,  and  describe  them  separately. 

1.3.1  Software  Related  Questions 

Almost  all  the  software  questions  we  list  here  affect  the  operating 
system  design.  As  we  just  mentioned,  a  simple  and  reliable  operating  sys- 
tem is  a  very   important  aspect  in  multiprocessor  design.  So,  our  atten- 
tion will  concentrate  on  the  simplicity  and  the  reliability  of  an  operating 
system. 

From  what  we  have  learned  in  the  previous  sections,  naturally,  our 
first  question  will  be: 

•  Should  we  use  multiprogramming  or  monoprogramming? 

Intuitively,  multiprogramming  can  result  in  higher  utilization  of 
both  processors  and  memories,  since  it  allows  more  jobs  to  be  packed  in  the 
memory  and  the  overlapping  of  processor  and  I/O  operations.  Of  course,  the 
busier  the  processors  are,  the  more  work  they  do,  and  the  higher  throughput 
they  produce. 

However,  this  does  not  necessarily  mean  the  turnaround  time  of 
a  job  will  be  better  than  monoprogramming.  There  will  be  higher  overhead 
due  to  multiprogramming.  For  example,  the  system  needs  to  save  the  current 
status  of  the  pending  job  in  the  memory  and  bring  in  the  status  information 
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of  the  new  job.  Moreover,  resource  contention  will  increase  since 
more  jobs  are   being  executed  at  the  same  time.  All  these  will  certainly 
increase  the  service  time  of  a  job.  Although  multiprogramming  may  yield 
shorter  queueing  time,  the  total  turnaround  time  will  not  necessarily  be 
shorter. 

As  we  pointed  out  earlier,  the  biggest  disadvantage  of  multipro- 
gramming is  that  the  software  is  more  complex,  expensive,  and  fault-prone. 
Unless  it  can  outperform  monoprogramming  by  a  large  margin,  there  is  no 
reason  why  we  should  use  it.  In  Chapter  3,  we  will  measure  the  performance 
of  both  systems,  which  can  tell  us  which  system  is  better  to  use  in  a 
multiprocessor  system. 

Our  second  question  is  about  the  memory  allocation.  When  a  job 
comes  in,  we  can,  as  in  the  PRIME  system,  allocate  as  many  modules  as  it 
needs.  This  job  will  occupy  these  modules  exclusively,  and  no  sharing  is 
allowed  (Figure  5-a).  We  call  this  scheme  a  "partitioned"  scheme.  Of 
course,  no  interference  between  two  jobs  will  occur,  since  jobs  are  physical- 
ly separated.  Therefore,  a  partitioned  system  is  wery   reliable  and  provides 
high  privacy. 

One  disadvantage  is  that  a  lot  of  memory  space  will  be  wasted.  On 
the  average,  each  job  will  waste  about  half  of  a  memory  module  if  the  job 
size  is  uniformly  distributed.  This  can  be  shown  in  a  few  lines.  Assuming 
each  memory  module  is  of  size  x,  and  the  job  size  is  y,  then  the  exepcted 
wasted  memory  is: 


(wasted) 


(a)  Partitioned  System 
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Figure  5.  Memory  Allocation  Schemes. 
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x  i=0  x 


1*  x(x-l) 
x  "  x     2 


2   memory  space  is 


x+1 


If  there  are  j  jobs  in  the  memory,  then  at  least  *!-= — '- 
wasted  on  the  average.  Even  more  memory  will  be  wasted  if  the  number  of 
modules  left  over  is  not  large  enough  for  the  next  job.  So,  this  scheme 
might  not  be  a  good  idea  unless  we  can  use  small  memory  modules. 

The  other  extreme,  as  in  C.mmp,  is  to  allow  the  jobs  to  share  all 
the  available  space.  One  way  to  share  memory  is  to  distribute  a  job  across 
all  the  memory  modules,  as  shown  in  Figure  5-b.  This  scheme  we  call  a 
"distributed"  scheme. 

In  a  distributed  system,  we  can  pack  as  many  jobs  in  the  memory 
as  possible,  and  each  job  can  enjoy  a  maximum  degree  of  interleaving.  How- 
ever, the  memory  contention  problem  arises  due  to  the  memory  sharing.  More- 
over, the  failure  of  any  single  module  can  cause  the  collapse  of  the  whole 
system. 

Of  course,  we  can  combine  these  two  schemes  and  form  a  third  scheme, 
which  we  call  a  "mixed"  scheme  (as  shown  in  Figure  5-c).  Basically,  this  is  a 

partitioned  system,  except  the  "overflow"  parts  are  put  together  in  a  single 
module.  Thus,  we  can  pack  more  jobs  in  the  memory,  reduce  the  wasted  space, 
and  still  enjoy  most  of  the  advantages  offered  by  a  partitioned  system.  We 
summarize  the  advantages  and  the  disadvantages  of  these  three  schemes  in 
Table  2. 
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Now,  our  second  question  is: 

•  What  kinds  of  performance  will  be  given  by  these 
schemes,  and  which  scheme  is  the  best  to  use? 

In  Chapter  3,  we  will  measure  all  these  schemes,  and  discuss  some 
problems  related  to  them,  for  example,  how  do  we  interleave  the  addresses 
if  we  want  to  use  the  partitioned  system  or  the  mixed  system. 

The  next  question  we  are  interested  in  is: 

•  What  kind  of  scheduling  algorithm  should  we  use? 

Every  operating  system  designer  must  face  this  question.  In  order 
to  answer  it,  we  have  to  know  what  type  of  system  we  are  dealing  with, 
what  kind  of  measurement  we  are  interested  in,  and  what  penalty  we  will 
suffer  if  we  let  some  job  wait  in  the  queue.  In  general,  people  measure  the 
goodness  of  a  scheduling  algorithm  by  using  the  average  turnaround  time  it 
produces.  So,  most  people  use  round-robin  in  a  time-sharing  system,  shortest- 
job-first  in  a  batch  system,  and  give  higher  priority  to  time-sharing  jobs 
in  a  mixed  system. 

In  our  study,  we  will  assume  that  we  are  dealing  with  a  batch  sys- 
tem. However,  this  does  not  imply  the  shortest-job-first  (SJF)  algorithm 
will  always  win  if  we  consider  the  average  turnaround  time,  since  we  will 
deal  with  some  special  architectures  like  a  partially  connected  system. 
Some  other  algorithm,  e.g.,  the  best-memory-fit-first,  might  perform  better 
than  SJF  under  that  circumstance.  We  are  even  interested  in  seeing  how  bad 
the  first-come-f irst-serve  algorithm  will  perform,  since  FCFS  does  not 
involve  any  scheduling  and  that  means  a  simple  operating  system. 

The  fourth  question  we  want  to  answer  is: 

•  How  do  the  job  characteristics  affect  the  system 
performance? 
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The  job  characteristics  include  the  mean,  the  variance,  and  the 
distribution  of  the  job  size,  the  processing  time,  the  inter-arrival  time, 
and  the  number  of  I/O  requests.  These  parameters  certainly  have  great  in- 
fluence on  the  system  performance.  For  example,  if  the  mean  job  inter-arrival 
time  is  too  small,  i.e.,  the  jobs  come  in  too  fast,  the  system  will  become 
over-saturated  and  the  queueing  time  might  go  to  infinity. 

In  order  to  avoid  all  these  undesirable  phenomena,  we  must  under- 
stand how  the  system  responds  when  a  certain  parameter  changes,  and  how 
sensitive  it  is.  Then,  we  can  always  keep  our  system  in  a  safe  region. 
Whenever  the  system  load  changes,  we  will  know  what  we  should  do  in  order 
to  maintain  satisfactory  performance. 

1.3.2  Hardware  Related  Questions 

The  cost-effectiveness  is  perhaps  the  subject  people  are  most  con- 
cerned about  in  system  design.  Every  designer  wants  to  know  how  much  per- 
formance improvement  he  can  get  for  a  certain  piece  of  hardware  he  adds. 
Of  course,  everyone  will  try  to  invest  his  money  where  he  can  buy  the  best 
improvement.  So,  this  trade-off  problem  usually  is  the  first  thing  people 
will  solve. 

Like  most  people,  our  first  hardware  related  question  is: 

•  How  is  the  system  performance  affected  by  the 
architectural  parameters? 

We  want  to  know  how  the  job  turnaround  time,  processor  and  memory 
utilizations  vary  when  we  change  certain  hardware  parameters,  like  the  num- 
ber of  processors,  the  number  of  memory  modules,  the  number  of  I/O  devices, 
or  the  size  of  a  memory  module,  etc.  We  also  want  to  know  how  the  system 
performance  will  be  affected  by  the  hardware  characteristics,  like  memory 
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cycle  time,  the  processor  cycle  time,  and  the  I/O  speed. 

As  we  mentioned  in  Table  1,  the  architectural  difference  between 
those  three  multiprocessor  systems  is  the  interconnection  schemes  they  use, 
namely,  the  crossbar  switch,  themultiport  memories,  and  the  time-shared 
common  buses.  In  fact,  we  can  classify  them  into  two  groups  according  to 
the  degree  of  connection  each  scheme  provides.  The  crossbar  switch  used 
in  C.mmp  will  be  called  a  "full  connection,"  since  every   processor  can 
access  any  memory  module  via  this  switch.  The  multiport  memories  used  in 
PRIME  or  the  common  buses  used  in  NonStop,  on  the  other  hand,  will  be  called 
a  "partial  connection,"  since  every   processor  can  only  access  part  of 
the  memory. 

Naturally,  we  would  like  to  know: 

•  How  much  degradation  will  we  suffer  if  we  use  a  partial 
connection  instead  of  a  full  connection? 

Of  course,  the  degradation  goes  up  as  we  decrease  the  number  of 
connection  points.  But  at  the  same  time,  we  reduce  the  cost  of  the  whole 
system.  Obviously,  this  is  a  trade-off  problem.  We  will  do  some  compari- 
sons in  Chapter  3. 

There  is  another  very   interesting  problem  associated  with  the 
partial  connection,  and  we  call  it  the  "connectivity"  problem.  Since  in  a 
partial  connection,  say  multiport  memories,  each  memory  module  will  be  con- 
nected to  only  some  of  the  processors,  so  the  question  is  how  many  memory 
modules  will  be  connected  to  a  particular  processor.  For  example  in  Figure  1, 
each  memory  module  has  four  ports  (but  only  three  have  been  used,  except 
one  module  uses  4),  and  each  processor  is  connected  to  eight  memory  modules 
in  a  fairly  regular  manner.  However,  this  uniform  connection  might  not  be 
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the  best  way.  Especially  when  the  number  of  ports  is  small  relative  to 
the  number  of  processors,  then  some  uneven  connection  might  be  necessary 
in  order  to  meet  some  requirements,  for  example,  one  processor  must  be  con- 
nected to  half  of  the  memory  modules  in  order  to  take  care  of  big  jobs. 
Again,  we  will  devote  a  section  to  discuss  this  interesting  problem. 

1 .4  Thesis  Outl ine 

In  order  to  answer  the  questions  we  raised  in  the  last  section, 
we  need  to  do  some  performance  measurements.  Two  methods  are  commonly  used 
by  people  for  measuring  the  system  performance,  namely,  queueing  analysis 
and  simulation.  Usually,  queueing  analysis  can  reveal  more  insight 
about  the  system  behavior,  since  we  can  see  from  the  analytic  solution  how 
a  certain  variable  affects  the  system  performance.  In  the  first  part  of 
Chapter  2,  we  will  discuss  some  analytic  work  people  have  done  in  measuring 
computer  systems. 

However,  queueing  techniques  are   only  good  for  simple  models  with 
some  simple  assumptions.  As  the  complexity  of  a  system  model  increases, 
the  queueing  analysis  will  soon  become  intractable.  Therefore,  people  switch 
to  simulation.  The  nice  thing  about  a  simulation  model  is  you  can  put  in 
as  many  parameters  as  you  want,  and  as  many  constraints  as  you  like.  So, 
the  simulation  technique  can  be  applied  to  yery   complicated  model.  Unfortu- 
nately, it  can  be  very   costly. 

Since  our  model  is  rather  complicated,  we  will  use  a  combination  of 
the  analytic  approach  and  simulation  to  measure  the  relative  performance  of 
various  systems.  In  the  second  part  of  Chapter  2,  we  will  describe  the 
simulation  model  we  use.  We  will  also  talk  about  some  memory  bandwidth 
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problems,  since  we  will  use  memory  bandwidth  to  determine  how  we  advance 
the  virtual  clock  of  our  simulator. 

In  Chapter  3,  we  will  present  all  the  results  and  try  to  answer 
the  questions  we  raised  in  the  last  section.  Finally,  in  Chapter  4,  we  will 
discuss  some  logic  design  problems  and  give  a  summary  of  all  our  results. 
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Chapter  2 
SYSTEM  PERFORMANCE  MEASUREMENT 


2.1  Queueing  Analysis 

In  this  chapter,  we  are  going  to  talk  about  how  we  measure  the 
performance  of  a  multiprocessor  system.  As  we  just  mentioned  at  the  end  of 
the  last  chapter,  two  common  techniques  can  be  used  for  this  purpose: 
queueing  analysis  and  simulation.  We  will  start  by  looking  at  some  queueing 
models  that  have  been  proposed  for  analysis  of  multiprocessor  systems. 

Using  queueing  techniques  to  study  the  system  performance  is  a 
very  old  subject.  People  have  been  active  in  this  area  for  quite  a  number 
of  years  and  a  lot  of  papers  have  been  written  on  this  subject.  However, 
most  of  the  effort  has  been  spent  in  the  following  areas: 

(1)  Performance  analysis  of  auxiliary  and  buffer  storage  like 
disk  [21,22],  drum  [23],  or  magnetic-bubble  [24]. 

(2)  Waiting  time  analysis  of  job  scheduling  disciplines 
[25,26]. 

(3)  Performance  analysis  of  single  processor  multiprogram- 
ming systems  [27]  and  single  processor  time-sharing 
systems  [28]. 

(4)  Performance  study  of  communication  networks  like 
ARPANET  [26]  or  ALOHA  [29]. 

Relatively  few  papers  have  been  written  about  multiprocessor  sys- 
tems. Perhaps  the  biggest  difficulty  is  to  formulate  the  resource  conten- 
tions into  the  model.  Besides,  if  we  want  to  consider  the  finiteness  of 
memory  size  and  I/O  operations,  then  the  whole  system  will  become  a  queueing 
network  with  blocking.  This  is  a  well-known  tough  problem  to  be  solved 
exactly.   In  some  papers,  e.g.,  [30],  people  just  ignore  all  these  problems 
and  treat  the  multiprocessor  as  a  M/M/p  queueing  system,  which  yields  a  bad 
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approximation.  Of  course,  we  are  interested  in  a  more  accurate  solution. 

Recently,  three  papers  have  been  written  which  provide  some 
analytic  methods  of  studying  multiprocessor  systems  [31,32,33].  In  two  of 
these  methods,  we  can  accurately  include  the  effects  of  finite  memory  size 
and  workload  memory  requirements  in  the  queueing  model.  We  feel  that  these 
papers  deserve  to  be  discussed  in  some  details  in  order  to  let  readers  have 
more  understanding  about  the  problem  and  realize  the  strength  of  these 
analytic  methods.  We  will  show  how  we  apply  these  methods  to  our  queueing 
problem,  and  discuss  the  advantage  and  disadvantage  of  each  method.  How- 
ever, due  to  the  complexity  of  the  systems  we  are  going  to  study,  we  will 
have  some  difficulty  including  all  the  effects  of  system  architecture  and 
resource  contention  into  a  queueing  model.  Unless  we  can  nicely  formulate 
everything,  we  will  not  be  able  to  get  accurate  results  from  any  of  these 
methods.  After  we  discuss  these  three  papers,  the  reader  should  realize 
why  we  will  rely  on  simulation  rather  than  queueing  analysis. 

Before  we  talk  about  these  analytic  models,  let  us  first  describe 
our  queueing  model  for  a  multiprocessor  system.  This  will  aid  in  under- 
standing the  later  discussion. 

2.1.1  Our  Queueing  Model 

Figure  6  is  the  basic  queueing  model  we  are  interested  in.  We 
assume  that  the  system  has  p  processors,  r  I/O  devices,  and  a  total  of  M 
kbytes  of  primary  memory  divided  into  m  modules.  When  a  job  arrives,  if 
either  there  are  already  D  jobs  in  the  system  or  the  available  memory  is 
not  big  enough,  it  will  be  queued  in  the  outside  waiting  queue.  Otherwise, 
this  job  will  enter  the  service  box  and  queue  in  the  processor  queue  for  the 
first  service.  If  the  job  gets  a  processor,  it  will  be  served  for  some  amount 
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of  time  nonpreemptively  until  it  requests  an  I/O  operation.  Then  it 
proceeds  to  the  I/O  queue  and  waits  for  an  I/O  operation.  After  the  I/O, 
this  job  will  depart  the  whole  system  with  probability  l-o,  or  return  to 
the  processor  queue  with  probability  o   and  the  cycle  starts  again. 

The  outside  waiting  queue  corresponds  to  the  HASP  queue  in  our 
IBM  360/75  which  holds  the  jobs  that  are  blocked  from  service.  The  average 
number  of  jobs  queued  here  is  an  indication  of  how  well  the  system  performs. 
A  good  scheduling  algorithm  could  be  used  here  to  reduce  the  average  queue 
length. 

The  number  D  indicates  the  maximum  number  of  jobs  allowed  in  the 
service  box.  It  is  equal  to  p  under  monoprogramming  and  some  constant  d 
under  multiprogramming  (d  is  called  degree  of  multiprogramming). 

Each  job  in  the  service  box  will  cycle  through  the  tandem  queue 
for  a  certain  number  of  times.  This  number  will  have  a  geometric  distribu- 
tion: 

C  -1 
Pr  {a  job  needs  2  cycles}  =  (l-a)a   ,  2=1,2,... 

with  mean  a  =  1/1-a.  a  is  acutally  the  average  number  of  I/O  requests  for 

a  job.  The  parameter  a   can  be  arbitrarily  defined  or  obtained  from  analyzing 

some  real  data. 

Figure  7  shows  the  timing  diagram  of  a  job  from  its  arrival  until 
its  departure.  Of  course,  in  any  stage,  if  the  resource  is  available  when 
the  job  arrives,  it  will  get  served  immediately  without  waiting. 

In  order  to  make  analysis  easier,  people  always  assume  the  job 
arrival  is  a  poisson  process.  In  other  words,  the  job  arrival  rate  is  con- 
stand  or  the  inter-arrival  time  is  exponentially  distributed.  Literature 
on  queueing  analysis  has  shown  evidence  that  this  is  a  pretty  acceptable 
assumption. 
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Figure  7.  History  of  a  Job, 
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However,  the  most  controversial  part  is  the  service  rates  of 
processor  and  I/O,  i.e.,  /.t,  and  v?   in  Figure  6.  In  order  to  use  the  nice 
results  of  queueing  theory,  we  will  have  to  assume  they  are  constant,  that 
is,  to  assume  both  the  processing  time  and  the  I/O  time  are  exponentially 
distributed.  This  is  a  \/ery   strong  assumption  to  make. 

For  the  I/O  service  rate,  if  we  neglect  the  interference  between 
I/O  requests,  this  assumption  may  be  alright.  But,  this  cannot  be  true 
for  the  processor.  The  service  rate  of  a  processor  should  be  a  function 
of  system  architecture,  memory  allocation  strategy,  the  number  of  jobs 
currently  being  executed,  the  memory  interference  they  create,  and  the 
original  inter-I/0  time  distribution.  In  general,  this  is  not  easy  to 
formulate. 

In  addition,  due  to  the  finiteness  of  the  memory  size  and  the 
maximum  number  of  jobs  allowed  in  the  system,  this  model  becomes  a  queueing 
system  with  blocking.  It  is  a  tough  problem  and  no  exact  solution  is  known 
yet  [34].  The  best  thing  people  can  do  is  to  use  an  approximate  model.  One 
example  is  Avi-Itzhak  and  Heyman's  model  which  we  are  going  to  discuss  in 
the  next  section. 

2.1.2  Avi-Itzhak  and  Heyman's  Method 

The  approach  adopted  by  Avi-Itzhak  and  Heyman  consists  of  two 
stages  [31].  The  first  stage  is  to  view  the  system  as  a  closed  queueing 
network  with  a  fixed  number  of  jobs,  and  obtain  the  average  cycle  time  for 
each  job.  Then  we  approximate  the  open  system  by  an  M/G/D  queue,  use  the 
result  of  stage  one  to  solve  the  state  balance  equations,  and  get  the  ex- 
pected time  a  job  will  spend  in  the  system. 
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By  a  closed  queueing  network,  we  mean  a  network  with  no  job  coming 
in  or  going  out.  Figure  8  shows  the  closed  queueing  network  used  in 
Avi-Itzhak  and  Heyman's  analysis.  It  consists  of  k  service  stations  with 
a  fixed  number,  say  n,  of  jobs  cycling  through  the  stations.  Station  1 
represents  a  group  of  processors,  and  stations  2  to  k  represent  various 
groups  of  peripheral  devices  such  as  disks,  tapes,  etc.  Station  I   contains 
r.  servers  operating  in  parallel  with  a  common  queue,  and  each  server  has 
the  same  expected  service  time  E(S.)-  When  service  at  a  processor  is  com- 
pleted, the  job  moves  to  station  I   with  probability  7r.,  where  7r,=0  and 

k 

2  7r  .  =  1 .  Upon  completion  of  service  at  station  I,   the  job  moves  back 
1=2 


< 


to  station  1  and  the  same  process  is  repeated. 

The  exact  solution  of  this  queueing  model  has  been  obtained  by 
Jackson  [35],  and  Gordon  and  Newell  [36].  We  will  repeat  their  result  here 
and  show  how  to  relate  it  to  our  model. 

Let  p  -  be  the  steady-state  expected  number  of  busy  servers  at 
station  I.     The  average  number  of  jobs  flowing  into  a  given  station  must 
equal  the  average  flow  out  of  the  station.  Therefore, 


[p1/E(S1)>^  =  P^/E(S^)S    for  I   =  2,  ...,  k, 


a.±   [E(S.)/E(S^, 


?i  =  al   Pi  ' 


Define 

we  obtain 

with  a,=l  by  definition. 

Assume  p(x^,  X2,  ...,  x.)  is  the  steady-state  joint  probability 
of  there  being  x  .,  i=   1,  2,  ...,  k,  jobs  at  station  I,    then  we  have 
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STATION  k 


Figure  8.  The  Closed  Queueing  Network  used  in  Stage  One, 
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Figure  9.  Approximated  M/G/D  Queueing  Model 
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k   x . 
p(x,,  x2,  ...,  xk)  =  c  n  [p^  V^(x^)].  (1) 

k 
In  this  equation  all  x-  ^  0,  2  x.  =  n,  c  is  the  normalization  constant, 

and 

\    x  . !        i  f  x  .  <_   r  - , 

W  =      *    x^  if  ,* 

lr-!r-      if  x.  >  r-. 

The  summation  of  p(x,,  x2»  •  •>  x.)  over  the  set  Dn  =  {x:  x  _>  0, 

k 

2  x-  =  n}  must  yield  the  value  1.  Therefore  from  equation  (1),  we  have 
4=1  * 

k   x  k   x 

1  =  c  2   n  [p//0  .(x.)]  =  c  p"  2   n  [a//0(x.)]  (2) 

D   4=1  ^   ^  ^       '  Dn  4=1  *   *  /C 

However,   the  expected  number  of  busy  servers  at  station  1    is  p, ,   that  is 

I  n 

P1    =      2      x      2      p(x,,x2,...,x.  )   +  r,        2       2      p(x] ,x2,. . .,xk),  (3) 

xl=1  Vxl  xfrl    Vxl 

k 
where  D  -x,  =  {x:  x  >  0,  2  X;  =  n-x,}.  Substitution  of  (1)  and  (2)  into 
n  '  4=2        ' 


(3)  yield 


rrl 


p     =   {    2      x,    2  A  +  r         2        2  A}  /    2    A 

xl=1  Vxl  Xl=ri   Dn~xl  Dn 

k  x. 

with  A  =     n     [a.  *-/fi  .(x.)]. 
._!         4  4      4 

4=1 

The  only  thing  we  need  now  is  an  algorithm  to  generate  all  the 

elements  in  D  and  D  -x, ,  then  we  can  easily  enumerate  p, .  After  obtaining 
p, ,  we  can  get  the  average  inter-arrival  time  at  station  1,  i.e.,  E(S-,)/p-,, 
by  applying  Little's  theorem.  Since  there  are  n  jobs  in  the  system,  the 
average  cycle  time  for  each  job,  i.e.,  the  time  between  two  visits  to  sta- 
tion 1  by  each  job,  should  be 
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T(n)  =  n  E^)  /  py 

This  is  the  result  we  want  in  stage  one. 

Now,  let  us  come  back  to  the  original  open  system.  Avi-Itzhak 
and  Heyman  propose  to  view  the  open  computer  system  as  an  M/G/D  queue, 
as  shown  in  Figure  9,  with  arrival  rate  X,  D  servers,  and  state-dependent 
service  rate  p  .  If  each  job  takes  a  cycles  on  the  average,  then  clearly, 

Jn/at(n)      n  =  1.....D. 
M  =  < 
n    HVaT(D)      n  >  D. 

which  is  the  rate  jobs  depart  from  the  system. 

Denoting  the  steady-state  probability  of  having  n  jobs  in  the 
system  by  P  ,  we  can  have  the  following  set  of  balance  equations: 

ropo   ■  Vr 

(rn+"n)Pn     =     Vl  Pn-1  %]   Pn+1  ■         n  =  1'2 

The  solution  to  this  set  of  equations  is 

oo 

2     P      =    1. 
n=0 

If  we  assume  X  >  n~,   and  X_  =  X,  =  ...  =  X,  we  get 

r"p()Xn/Ju1M2...Mn,     n  =  1,2,...D. 

PD(X/juD)n"J.        n  >  D. 

By  Little's  Theorem,  we  obtain  the  expected  time  a  job  spends  in  the  whole 
system 
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E(T)  =  1/X  2  n  P 
s        n=0    n 

This  is  the  final  result  we  are  looking  for. 

For  our  queueing  model,  the  calculation  in  stage  one  is  much 

simpler  since  we  only  consider  two  stations  (k=2).  Therefore,  there  are 

only  n  elements  in  D  and  one  in  D  „  ,  i.e.,  n=xn .  The  equation  for  p, 
J  n  n-x,         l       n  r  I 

then  becomes: 


-\ 


V1 


n-x 


n-x- 


*.,  xi  ir^WF^T 


+  r. 


n 

2 


a, 


r  ^(x^n-x^ 


n-x 


X]=l  ^l(xl)/32(n-x2} 

If  we  assume  monoprogramming,  i.e.,  n  <_  r-i  =  P.  the  above  equation  can  be 

further  reduced  to: 

n-x. 


n-x. 


a„         n     cu 
'1  "   *,  *1  Vtyn-x,)  '  S„  x^^n-x,) 


n 

2  x. 


The  rest  of  the  calculation  remains  the  same. 

The  advantage  of  this  method  is,  obviously,  the  ease  of  enumera- 
tion. However,  there  are  several  problems  with  this  model.  First,  we  do 
not  know  how  accurate  it  is  to  approximate  an  open  computer  system  by  an 
M/G/D  queue.  Second,  this  model  does  not  consider  the  effects  of  finite 
memory  size  and  things  like  the  scheduling  algorithm.  Third,  and  the  big- 
gest reason,  we  do  not  know  the  value  of  a«  (the  ratio  of  service  rates). 

Of  course,  we  can  figure  out  how  a~  affects  the  performance.  How- 
ever, we  do  not  know  how  the  system  architecture,  memory  allocation  strategy, 
and  other  factors  affect  the  value  of  a?.  An  analytic  formula  is  very 
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difficult  to  get.  One  way,  perhaps  the  only  feasible  way,  is  to  get  this 
value  by  simulation.  However,  if  we  must  simulate  to  find  a„,  we  might  as 
well  use  simulation  from  the  beginning. 

2.1.3  Konheim  and  Reiser's  Method 

Figure  10  shows  the  queueing  model  studied  and  solved  by  Konheim 
and  Reiser  [32].  It  is  a  two-stage  tandem  queue  with  feedback  and  a  finite 
intermediate  waiting  queue.  The  arrival  of  jobs  is  a  Poisson  process  of 
rate  A,  and  service  times  in  both  stages  are  exponentially  distributed  with 
a  and  ]3 ,  respectively.  The  principal  characteristic  of  the  service  in  this 
network  is  blocking.  The  first-stage  server  is  blocked  and  ceases  to  offer 
service  whenever  M  jobs  are  enqueued  in  the  second  stage.  In  other  words, 
stage  two  cannot  accummulate  more  than  M  jobs.  Service  resumes  in  the 
first  stage  when  the  number  of  jobs  in  the  second  stage  falls  to  M-l.  As 
usual,  when  a  job  leaves  the  second  server,  it  will  depart  from  the  system 
with  probability  1-a  or  rejoin  the  first  queue  with  probability  o. 

This  queueing  model  is,  of  course,  quite  different  from  our  model 
shown  in  Figure  6.  The  only  thing  we  are  interested  in  is  the  method 
Konheim  and  Reiser  use,  which  we  can  apply  to  our  problem  with  a  little 
modification. 

In  fact,  the  method  they  use  is  very   basic,  namely,  using  the 
state  balance  equations.  For  the  queueing  network  shown  in  Figure  10,  they 
come  out  with  the  following  equation: 
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Figure  10.  A  Tandem  Queue  with  Finite  Waiting  Room 
and  Blocking. 
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[X     +    Ol,    .      n       .     M%      +     jJI,    .     n%]      P;       ;  =         XI,    .      nxP.      ,         • 

L  U>0,j<M)  (j>0)J     >c,j  U>0)  -c-l.j 

+  0aIU>O,/<M)P,t-l,/+l 

+  /3  (1-a)I ,  .  M.P.    .  . 
(j<M)  ^,j+l 

+  aIU>o)p-t+i,/-r        o</<~,o<j<m. 

P.   is  the  stead-state  probability  that  there  being  I   jobs  in  the  first 
stage  and  _/  jobs  in  the  second  stage.  I„  is  the  indicator  function  of  the 
condition  A,  which  will  be  one  if  the  condition  holds  and  zero  otherwise. 
The  left  hand  side  of  the  equation  shows  the  possibilities  that  will  cause  the 
system  to  leave  the  states  {l,j),   and  the  right  side  contains  the  possibili- 
ties that  will  cause  the  system  enter  this  state. 

Although  the  result  Konheim  and  Reiser  obtain  from  their  analysis 
cannot  be  applied  to  our  model,  we  certainly  can  use  the  method  described 
above,  except  we  have  to  increase  the  number  of  indices  to  three  since 
our  model  has  three  stages. 

Let  us  recall  our  model  in  Figure  6.  Assume  P.-,  is  the  steady- 
stage  probability  that  there  are  I  jobs  enqueued  in  the  waiting  queue,  /  jobs 
in  the  processor  stage,  and  k   jobs  in  the  I/O  stage,  we  can  have  the  fol- 
lowing balance  equation  which  is  similar  to  Konheim  and  Reiser's  equation. 

[X  ♦  f(i')  ♦  g(fe)]  P^fe    ■    Pcjy+utMj'U+U 

+     Px(/-l)(fe+1)9'fe+1»    " 

+  P(;_i)ik^  PyJnew  arrivial  cannot  enter|system 

is  in  state  U-l  ,/,fe)} 

+  ^i(i-~\)k^   pr^new  arri'val  can  enter|system  is  in 
state  U,j-1  ,fe)} 
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+  p<-;(f,+])9(fe+l)(l-<y)  pr<no  job  can  enter]  system  is 
in  state  {IJ ,fe+l)> 

+  P(^+l)(y-l)(fe+l)9(fe+l)(l-^)  Pr(a  Job  can  enter|sys- 
tem  is  in  state  (^c+1  ,y-l  ,fe+l )} 

for  I   >  0,  D  >  y+fe  >  0.  (4) 

Where  f(j)  is  the  total  service  rate  of  processors  when  /  jobs  are  in 
the  processor  stage,  and  similary  g(fe)  is  the  total  service  rate  of  I/O 
devices  when  k   jobs  are  in  the  I/O  stage.  We  drop  the  indicator  function 
from  the  equation.  Of  course,  any  term  on  the  right  hand  side  will  vanish 
if  it  has  a  negative  index. 

The  calculation  of  those  conditional  probabilities  is  a  yery 
interesting  subject.  It  depends  on  several  things,  for  example,  the  job 
scheduling  algorithm,  the  memory  allocation  strategy,  the  total  memory 
size  (M),  and  the  job  size  distribution.  Of  course,  it  also  depends  on  the 
state  (X,y,fe).  Perhaps  the  best  way  to  explain  this  is  to  give  an  example. 

Let  us  look  at  the  first  conditional  probability,  namely,  the 
probability  that  the  new  arrival  cannot  enter  the  memory,  given  that  the 
system  is  in  state  (^c-l,y,fe).  If  we  assume  a  f irst-come-first-serve 
(FCFS)  scheduling  algorithm,  then,  trivially,  this  probability  should  be  1 
when  there  is  more  than  one  job  in  the  waiting  queue,  i.e.,  -t-1  >_  1  or  I  >_  2. 
However,  when  i=1 ,  i.e.,  the  system  is  in  state  (0,y,fe),  this  probability 
will  become  the  probability  that  this  new  job  can  enter  the  memory  given 
there  are  already  j+k   jobs  in  there.  To  answer  this  question,  we  must  know 
the  last  three  things  we  mentioned  in  the  last  paragraph.  Let  us  assume 

we  use  distributed  memory  allocation  (cf.  Chapter  1),  and  the  job  size  is 

2 
normally  distributed  with  mean  n   and  variance  v  ,  then  the  probability  will 

be: 


^(M-Q>fe)M   }_  ^(M-(,/+fe+1)M) 

//+fe     v         /y+fe+i     v 
r x   "t2/2  dt 

where  4>(x)  =  l//2w  J   e 

—  oo 

As  we  can  see,  all  the  terms  on  the  right  hand  side  might  have 
quite  different  forms  from  equation  to  equation,  since  they  depend  heavily 
on  the  state  and  other  factors.  In  general,  it  is  impossible  to  solve 
equation  (4)  analytically  by  using  a  transformation  technique.  This  is  the 
same  problem  Konheim  and  Reiser  encountered,  although  their  equation  is  much 
simpler  than  ours.  If  we  only  want  to  solve  the  state  probability,  then  we 
can  use  a  numerical  method  Konheim  and  Reiser  use  for  their  model .  The 
method  is  very   straightforward.  However,  it  requires  two  huge  arrays. 

If  we  let  P  =  [P-'jl]  be  the  state  probability  "vector,"  then  we 
can  rewrite  equation  (4)  in  the  following  form: 

P  =  PA 

where  A  is  the  state  transition  matrix.  Each  entry  in  A  can  be  enumerated 
by  the  method  we  just  described. 

Figure  11  is  a  three  dimensional  representation  of  equation  (4). 
We  can  see  equation  (4)  forms  an  irreducible  Markov-chain  since  every  state 
can  be  reached  by  any  other  state.  For  example,  from  state  U,j',fe-1)  we 
can  go  to  state  (^c,j,fe)  via  state  (-c,j'-l,fe).  Only  the  direct  transitions 
to  and  from  state  (<t,/,fe)  have  been  shown  in  Figure  11.  The  number  attached 
to  each  arc  represents  the  corresponding  state  transition  on  the  right  hand 
side  of  equation  (4). 
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*  Only  transitions  to  and  from 
state  (i,j,k)  have  been  shown 


Figure  11.  The  Irreducible  Markov  Chain 
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After  getting  the  state  transition  matrix  A,  we  can  write  a 
simple  program  to  calculate  the  state  probabilities.  Here  is  an  algorithm 
we  can  follow. 

Step  1.  Initialize  P;  set  p0oo=1  and  P'/fe=0  for  a11  other 
elements.  Also,  n=0. 

Step  2.  T  *■  A*P,  n  =  n+1 . 

Step  3.  Is  n  <^  Limit?  If  not,  stop,  and  check  all  parameters. 

Step  4.  Is  T  "  P?  If  not,  P  ^  T  and  go  back  to  Step  2. 

Step  5.  (Obtain  steady-state  probabilities)  Calculate 
everything  we  want. 

If  the  arrival  rate  X  is  small  enough,  the  process  will  converge 
eventually  and  we  will  obtain  the  steady-state  probabilities.  We  can  then 
calculate  the  average  number  of  jobs  in  the  system  (N),  and  the  average 
turnaround  time  (W)  by  using  Little's  Theorem.  Let  n  =  l+j+k,   then 

oo 

n=0    ^k 
W  =  N/X. 

Although  the  calculation  of  those  conditional  probabilities  might  be  dif- 
ficult sometimes,  it  indeed  provides  a  great  capability  to  measure  a  lot 
of  systems.  For  example,  instead  of  FCFS  we  can  implement  last-come-first- 
serve  (LCFS),  look-ahead  (LA),  or  any  scheduling  algorithm  we  can  formulate. 
Later  in  the  next  section,  we  will  introduce  a  recursive  method  given  by 
Chandy,  et  al  [33],  which  is  very   useful  on  this  subject. 

One  other  advantage  of  this  method  is  that  it  includes  the  effects 
of  finite  memory  size  and  the  limited  number  of  jobs  allowed  in  the  memory. 
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This  is  what  Avi-Itzhak  and  Heyman's  method  cannot  supply. 

However,  one  problem  still  has  not  been  solved  yet,  i.e.,  what  the 
service  rates  of  processors  and  I/O  devices  are.  This  is  why  we  did  not  talk 
about  the  functions  f(/)  and  g(fe)  contained  in  equation  (4),  because  it  is 
rather  difficult  to  formulate  them  to  cover  the  effects  of  memory  allocation 
and  memory  interference,  especially  in  a  partial  connection  architecture. 
Again,  we  reject  this  method,  since  the  service  rates  are  \/ery   crucial  in 
getting  correct  answers. 

2.1.4  Brown,  Browne  and  Chandy's  Method 

Brown,  Browne  and  Chandy's  model  [33]  is  designed  for  the  analysis 
of  a  computer  system  with  a  nonpaged  memory  executing  a  multiaccess  inter- 
active workload.  The  actual  system  is  the  dual  CDC  6400/6600  computer 
complex  at  the  University  of  Texas  at  Austin.  Their  main  purpose  is  to 
study  the  effects  of  the  finite  size  of  executable  memory,  job  workload 
characteristics,  system  overheads,  and  swap  times  on  the  system  performance. 
Therefore,  they  do  share  some  common  interests  with  us.  The  only  difference 
is  that  we  are  also  interested  in  the  effect  of  hardware  architecture.  In 
other  words,  Brown,  Browne  and  Chandy  are  trying  to  improve  a  currently 
existing  system,  but  we  are  trying  to  design  a  better  system  for  the  future. 

Figure  12  shows  a  network  representation  of  the  overall  system  . 
to  be  modeled.  When  a  job  is  at  the  user  terminal  (physically  the  job  is 
in  the  extended  core),  it  will  suffer  a  "thinking  time"  delay  which  cor- 
responds to  the  time  period  from  the  last  response  until  the  user  submits 
the  next  command.  Then,  the  job  will  be  queued  for  swap- in.  This  is  re- 
presented by  Swap  Delay.  When  the  job  is  allowed  to  enter  the  main  memory, 
it  will  first  need  some  amount  of  time  to  be  swapped  from  ECS  to  the  main 


JOBS  IN  ECS 


SWAP  DELAY 


OH 


THINK  TIME 


USER  TERMINALS 
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SWAP  OUT 


SWAP  IN 


JOBS  IN  MAIN  MEMORY 


Figure  12.  Overall  System  to  be  Modeled, 
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memory,  then  it  will  queue  for  CPU  service.  After  the  CPU  processing,  the 
job  will  either  go  to  the  I/O  stage  or  be  ready  to  be  swapped  out.  If  it 
goes  to  the  I/O  stage,  it  will  rejoin  the  CPU  queue  for  another  service  after 
the  I/O  operation.  If  the  job  is  ready  to  respond  to  the  user  terminal,  it 
will  be  swapped  out  of  the  main  memory  to  ECS,  and  the  whole  cycle  begins 
again. 

If  the  thinking  time,  the  swapping  delays,  and  the  service  times 
are  all  exponentially  distributed,  the  whole  system  is  a  Poisson  queueing 
network  and  we  can  apply  Gordon  and  Newell 's  result  [36].  However,  due  to 
the  finiteness  of  the  main  memory  size,  a  job  might  get  blocked  in  the  swap 
delay  stage  even  though  it  is  at  the  head  of  the  queue.  In  other  words, 
because  of  insufficient  memory  space,  the  job  might  suffer  an  unpredictable 
queueing  delay  before  being  swapped  in.  This  means  the  system  will  not 
behave  exactly  like  a  Poisson  queueing  network  any  more.  Therefore,  some 
other  technique  is  needed  to  account  for  this  effect. 

A  hierarchical  decomposition  technique  [37]  is  then  used  to 
analyze  the  system.  They  decompose  the  system  of  Figure  12  into  two  systems 
shown  in  Figure  13,  and  use  the  following  four  computational  steps  to  get 
the  result. 

Figure  13(a)  is  called  the  device  model.  It  is  identical  to  the 
overall  system  of  Figure  12,  except  the  think  times  are  assumed  to  be  zero 
in  the  device  model.  Now,  the  device  model  is  analyzed  assuming  there  are 
n  jobs  in  the  model.  All  these  n  jobs  are  also  assumed  to  have  no  problem 
of  getting  into  the  main  memory.  In  other  words,  this  is  equivalent  to  a 
closed  Poisson  queueing  network  with  n  jobs  in  it  and  no  memory  constraint. 
Hence,  we  can  easily  determine  7(n),  the  expected  number  of  jobs  swapped  out 


53 


SWAP  DELAY 


CH 


SWAP  IN 

hO 


T(n) 


SWAP  OUT 


(a)  Device  Model 


THINK  TIME 


USER  TERMINALS 


hO 


COMPUTER  SYSTEM 


(b)  Overall  Model 


Figure  13.  Component  Models  of  the  Overall  System  of  Figure  12 
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per  unit  of  time  given  there  are  n  jobs  in  the  system,  or  the  throughput 

out  of  the  swap-out  queue.  The  analytic  solution  can  be  obtained  by  applying 

Jackson's  or  Gordon  and  Newell  rs  equation. 

In  the  second  step,  we  will  compute  h(n|m)  which  is  assumed  to 
be  the  conditional  probability  that  n  jobs  have  been  allocated  memory 
given  that  there  are  m  ready  jobs.  By  a  ready  job,  we  mean  a  job  is  either 
waiting  to  be  swapped  in,  or  already  in  the  main  memory.  When  a  job  gets 
swapped  out,  it  leaves  the  ready  state  and  enters  the  think  state.  Brown, 
Browne  and  Chandy  propose  the  following  method  to  compute  this  quantity 
recursively. 

Let  us  assume  all  the  ready  jobs  form  a  single  f irst-come-first- 

serve  queue.  A  job  will  join  the  end  of  this  queue  whenever  it  becomes 

ready,  and  will  be  removed  from  the  queue  whenever  it  goes  back  to  the  think 

state.  Those  jobs  already  in  the  main  memory  tend  to  cluster  in  the  front 

part  of  the  queue.  Now,  let  us  go  through  the  ready  queue  and  compute 

g(n ,y |c ) ,  for  all  n  =  0,1,.. . ,m  and  y  =  0,1,. ..,c,  where  c  is  the  total 

memory  size.  g(n,y|e)  is  the  conditional  probability  that  n  jobs  have  been 

allocated  y  units  of  memory,  given  that  the  first  £  ready  jobs  have  been 

considered.  Obviously,  we  have  the  following  simple  facts: 

f 
1 ,  for  n=0,  y=0,  C=0, 

g(n,y|c)  =    \  0,  for  n=0,  y>0,  all  c, 

0,  for  n>£  ,  all  y. 
v.. 

Then,  g(n,y|C)  can  be  computed  from  the  following  recursive  equation: 
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g(n,y|«)  =  g(n,y|fi-l)  [l-P(c-y) 

y 

+   2  g(n-l,x|K-l)  p(y-x),   for  £  =  1 ,2,.. . ,m. 

x=0 

Where  p(x)  is  the  probability  that  a  job  will  be  of  size  x,  and  P(x)  is  the 

x 
cummulative  probability,  i.e.,  P(x)  =  2  p(y).  The  proof  of  the  above 

y=o 

equation  is  wery   simple  and  can  be  found  in  [33].  After  getting  all  these 

conditional  probabilities,  we  can  calculate  h(n|m)  by  summing  over  all 

c 
values  of  y,  i.e.,  h(n|m)  =  2  g(n,y|m). 

y=0 

In  the  third  step,  we  will  consider  the  queueing  network  shown 

in  Figure  13(b),  which  is  called  the  overall  model.  This  model  consists  of 

just  user  terminals  and  a  single  queue/server  called  the  computer  system. 

All  the  servers  in  the  device  model  are  coalesced  into  a  single  server. 

The  number  of  jobs  in  this  model  is  the  total  number  of  jobs  in  the  system, 

i.e.,  M.  The  ready  jobs  are  those  in  the  computer  system.  Let  q(m)  be  the 

expected  number  of  jobs  serviced  by  the  computer  per  unit  of  time  given  that 

there  are  m  jobs  ready.  It  can  be  computed  by 

m 
q(m)  =   2  T(n)h(n|m),    for  1  <  m  <  M. 
n=l 

Therefore,  what  we  have  now  is  the  queue-length-dependent  service  rate  of 

g(m)  when  there  are  m  jobs  in  the  computer  queue.  We  can  then  apply  the 

balance  equation  technique  to  solve  the  state  probabilities,  and  calculate 

the  system  throughput. 

Basically,  this  method  uses  the  same  technique  as  the  first  method 

by  Avi-Itzhak  and  Heyman.  Both  methods  want  to  find  out  a  state-dependent 

service  rate  first  and  use  the  classical  balance  equation  method  to  solve 
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the  rest  of  the  problem.  However,  this  method  is  more  powerful  and  accurate 
since  it  considers  several  things  the  first  method  does  not  include. 

Unfortunately,  like  the  other  two  methods,  this  method  still  does 
not  include  the  effects  of  system  architecture  and  resource  contention.  In 
addition,  the  calculation  of  g(n,y|C)  is  only  limited  to  yery  simple 
scheduling  algorithms.  Since  we  are  interested  in  the  effects  of  all  these 
factors,  we  have  decided  not  to  use  an  analytic  approach.  Instead,  we  will 
use  simulation  which  allows  us  to  consider  as  many  parameters  as  we  want. 

In  the  next  section,  we  will  talk  about  how  we  do  the  simulation 
and  discuss  some  problems  associated  with  the  simulator.  We  will  also  give 
some  definitions  of  the  parameters  we  are  measuring  which  will  be  wery   useful 
in  our  discussion  in  the  next  chapter. 

2.2  Simulation 

In  the  last  three  sections,  we  talked  about  some  analytic  tools 
for  studying  computer  performance.  Although  we  devoted  quite  a  number  of 
pages  to  these  methods,  what  we  really  wanted  to  do  was  to  show  their  limi- 
tations and  to  explain  why  we  cannot  use  them  for  our  work. 

In  Chapter  1,  we  discussed  several  problems  we  are  interested  in: 
we  want  to  compare  several  memory  allocation  schemes;  we  want  to  study  the 
effect  of  scheduling  algorithms;  we  want  to  compare  partial  connection  with 
full  connection;  we  want  to  see  whether  we  should  use  multiprogramming  or 
monoprogramming;  and  so  on.  All  these  make  our  system  extremely  complicated, 
and  none  of  those  analytic  models  can  cover  all  the  problems.  Hence,  the 
simulation  technique  must  be  used  to  meet  all  our  requirements. 

Although  simulation  is  a  very   expensive  method,  it  can  handle  any 
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arbitrarily  complicated  model.  We  can  increase  or  decrease  the  system  com- 
plexity at  our  will.  It  is  this  ability  to  cope  with  reality  that  makes 
simulation  more  useful  than  queueing  analysis. 

Before  we  describe  our  simulator,  we  would  like  to  discuss  a 
problem,  namely,  the  memory  bandwidth  problem.  By  memory  bandwidth,  we 
mean  the  number  of  words  we  can  access  in  the  main  memory  in  a  unit  of  time. 
Usually,  bandwidth  will  be  measured  in  number  of  words  per  memory  cycle. 
In  other  words,  the  memory  bandwidth  represents  the  information  flow  rate 
in  or  out  of  the  main  memory.  Since  most  of  the  processor  operations  are 
related  to  the  memory,  e.g.,  the  instruction  and  the  operand  fetches, 
memory  bandwidth  significantly  affects  the  system  throughput.  The  higher 
the  bandwidth  is,  the  faster  the  processors  operate. 

Hence,  in  order  to  determine  how  fast  the  system  operates,  we 
must  know  how  much  memory  bandwidth  we  can  get  from  the  system.  As  we  will 
see  later,  this  is  what  we  use  to  advance  the  virtual  clock  in  our  simulator. 

In  the  next  section,  we  will  first  derive  a  simple  bandwidth  equa- 
tion. Then,  we  will  show  a  general  equation  we  use  in  the  simulator.  This 
general  equation  can  be  used  to  handle  different  kinds  of  memory  allocation 
and  different  types  of  system  architecture.  We  can  also  put  in  some  parameters, 
e.g.,  the  memory-processor  speed  ratio,  in  order  to  take  care  of  things  that 
might  exist  in  a  real  computer  system. 

2.2.1  Memory  Bandwidth  Problem 

We  just  defined  the  memory  bandwidth  to  be  the  average  number  of 
words  we  can  access  during  one  memory  cycle.  From  the  memory's  point  of  view, 
it  is  the  average  number  of  busy  memory  modules  per  unit  of  time.  Of  course, 
this  quantity  depends  on  a  lot  of  factors,  for  example,  how  many  references 
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a  processor  will  make  in  one  memory  cycle;  the  inter-relationship  or  pat- 
tern of  these  references;  how  many  processors  are  accessing  the  memory;  how 
do  they  interface  with  each  other;  etc.  In  general,  this  is  not  a  simple 
problem  to  solve.  However,  if  we  can  make  some  reasonable  assumptions,  then 
we  might  be  able  to  come  out  with  some  closed  form  solutions. 

Let  us  start  with  a  simple  problem.  Assuming  there  are  m  identical 
memory  modules  operating  in  parallel,  and  a  processor  is  generating  s  randomly 
distributed  references  to  these  memory  modules  per  memory  cycle,  what  is 
the  memory  bandwidth  of  this  system?  For  example,  the  processor  is  s  times 
faster  than  the  memory  and  it  is  accessing  s  items  at  random  addresses.  Or 
equivalently,  there  are  s  (usually  we  use  p,  so  s=p  in  this  case)  indepen- 
dent processors  and  each  one  is  only  making  one  memory  reference  per  memory 
cycle.  This  is  an  interesting  combinatorial  problem  whose  answer  is  given 
by  the  following  theorem  [38]. 

Theorem:  Given  m  identical  memory  modules  operating  in 
parallel,  if  we  generate  s  randomly  distributed  references, 
then  the  average  number  of  busy  memory  modules (bandwidth) 
will  be: 


t         (m) 
*  ,,k!S(s,k)V 


rill) 

Bw     k=l 


where  t=min(m,s)  and  S(s,k)  is  the  Stirling  number  of 

the  second  kind. 
We  prove  in  [39]  that  this  equation  can  be  reduced  to  a  very  simple  closed 
form,  that  is: 

Bw  =  m[l  -  0-l/m)S].  (5) 


59 


If  we  are  interested  in  asymptoctic  behavior,  then  the  above  result  can  be 
transformed  into  the  following  linear  form  as  we  keep  the  ratio  of  s  and  m 
fixed  and  let  m  and  s  go  to  infinity. 

B   =  m[l  -  l/er],    where  r=s/m. 
w     L       J 

What  this  result  implies  is  that,  when  we  double  the  number  of 
processors  and  memory  modules,  the  memory  bandwidth  we  will  get  is  also 
doubled,  provided  each  processor  is  generating  one  independent  reference 
per  memory  cycle.  This  contradicts  what  people  have  always  called  the 
biggest  disadvantage  of  a  multiprocessor,  namely,  doubling  the  cost  will  not 
double  the  performance.  The  most  famous  result  is,  of  course,  the  square 
root  equation  proposed  by  Hellerman  [40],  which  says  that  the  memory  band- 
width of  an  interleaved  memory  system  only  grows  as  the  square  root  of  the 
number  of  memory  modules.  So,  when  we  double  the  modules,  the  memory  band- 
width only  increases  by  roughly  40%.  Apparently,  this  result  is  too  con- 
servative. 

In  fact,  the  result  in  equation  (5)  can  be  obtained  in  another  way. 
Let  us  look  at  one  specific  memory  module.  Since  each  processor  (or  reference) 
will  access  this  module  with  probability  1/m,  the  probability  it  will  not  be 
accessed  by  a  processor  is  1  -  1/m.  Since  all  processors  are  independent, 
the  probability  that  none  of  them  will  reference  this  module  will  be 

c 

(1  -  1/m)  .  Therefore,  the  probability  that  at  least  one  processor  will 

c 

reference  this  module  is  1-(1  -  1/m)  ,  which  is  the  probability  that  this 
memory  will  be  busy.  Summing  over  all  the  memory  modules,  we  will  get 
equation  (5)  as  the  average  number  of  busy  modules,  or  the  memory  bandwidth. 
This  method  is  s/ery   useful,  and  later  we  use  it  to  get  our  general  bandwidth 


60 


equation. 

Now,  let  us  go  back  to  the  first  problem.  It  is  generally 
acknowledged  that  references  generated  by  a  processor  should  not  be  con- 
sidered as  randomly  distributed.  Instead,  there  should  be  some  kind  of  re- 
lationship between  two  references.  Hence,  Burnett  and  Coffman  [41]  intro- 
duced the  effects  of  serial ity.  They  assume  that  the  probability  of  the 
next  reference  addressing  the  next  module  in  sequence  (modulo  m)  will  be  a, 
and  the  probability  of  addressing  any  other  module  will  be  j3,  where 
/3  =  (l-a)/(m-n).  Or  formally,  let  r.  be  the  module  number  of  i   reference: 
then 

Vri+1  =  0^+1)  mod  m}  =  a, 

Vri+1  *  (ri+D  mod  m}  =  0,  for  i  =  l  ,2, . . .  ,s-l . 

Where  s  is  the  number  of  references  generated  per  memory  cycle.  Then,  the 
memory  bandwidth  is  given  by  the  following  theorem  [30,42]. 
Theorem:  Assume  the  processor  generates  s  memory 
references  per  memory  cycle.  If  the  next  reference 
in  line  will  address  the  next  module  in  sequence 
(modulo  m)  with  probability  a  and  any  other  module 
out  of  sequence  with  probability  0  =  (l-a)/(m-l), 
then  the  memory  bandwidth  will  be: 

fe-1 


2   2  J  Hk~l-j   Cm(/,fe), 


B   = 

w       m  y=o 

where  t  =  min  (m,s),  and 

C  (j,fe)  =  (fe"-1)  Z   (-l)n(fe_j"1)(m-j-n-l),  -   ,  . 
m  J   '      1  n        n  '  J  fe-f-n-1 

J   n=0  J 
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If  we  plot  B  against  a,  we  can  see  the  bandwidth  grows  exponen- 
tially as  a   increases.  This  implies  that,  if  a  program  has  a  high  serial ity 
a   and  if  addresses  are  distributed  across  the  memory  modules,  then  we  can 
get  yery   high  bandwidth  out  of  the  memory. 

However,  if  there  is  more  than  one  processor  the  problem  becomes 
yery   complicated,  since  we  should  also  consider  the  interference  between 
processors.  The  solution  for  one  processor  is  already  very  messy,  it  will 
be  even  more  difficult  to  use  the  same  kind  of  approach.  Therefore,  we  do 
need  a  new  technique  for  finding  out  the  memory  bandwidth. 

Recall  the  probability  approach  we  just  described  to  derive  equa- 
tion (5).  If  we  can  figure  out  the  probability  that  a  certain  module  will  be 
busy,  then  we  can  sum  all  these  quantities  together  and  obtain  the  total 
memory  bandwidth.  This  is  a  \iery   basic  and  useful  idea  of  getting  memory 
bandwidth.  Using  this  idea,  we  can  write  down  the  following  general  band- 
width equation: 

m      p 
B   =   2  [1  - H  (1-p  )],  (6) 

w   y=i    w      v 

where  m,  p  are  the  numbers  of  memory  modules  and  processors,  and  p.. is  the 
probability  that  the  i   processor  will  reference  the  j   module.  Of  course, 
the  only  problem  is  to  find  out  all  the  p..'s.  For  example,  let  all 
p..  =  1/m,  then  equation  (6)  is  reduced  to  equation  (5). 

Let  q..  =  1-p.-  be  the  probability  that  the  i   processor  will 
not  reference  the  j   module.  The  above  equation  can  be  rewritten  as: 


B   =  m  -  2   2  q .  . 
j  =  \   4=1   J 
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Sometimes  it  is  easier  to  get  q..  then  p... 

Now  we  can  solve  the  multiprocessor  bandwidth  equation  problem 

mentioned  above.  We  state  the  problem  and  the  solution  in  the  following 

theorem.  The  only  thing  we  have  to  do  is  to  figure  out  p.-'s  and  substitute 

into  equation  (6). 

Theorem:   Assume  we  have  p  processors  referencing  m 
memory  modules.  Each  processor  generates  s  references 
per  memory  cycle  and  they  have  the  serial ity  relationship 
stated  in  the  previous  theorem.  Let  p\,'  be  the  probability 
that  the  i   processor  will  generate  a  fe  in  the  2   posi- 
tion and  no  j   occurs  before,  i.e.,  the  probability  of  the 
event  shown  in  Figure  14,  then  the  bandwidth  equation  for 
s  >  1  will  be: 

B   =  m  -  2   S  q(s.) 

j=\    xC=l    J 

where 

D(s-1)  D(s-1) 

q(s)      .     (1    _      (j-1),  „    .    [PilizILa  +  (1    -  ^'Zll-)  iz£]} 

"j  "j  l-p^         l-p^  ^ 

Is) 
The  proof  of  this  theorem  and  the  calculation  of  pv- -  are  given  in  Appendix  A. 

Another  derivation  by  using  Markov  chains  is  also  given  there. 

This  theorem  shows  the  usefulness  of  equation  (6).  We  can  use 

it  for  deriving  the  memory  bandwidth  of  a  great  variety  of  systems.  We 

will  also  use  equation  (6)  in  our  simulator.  However,  the  real  meaning  of  a 

is  very  vague.  Every  program  has  a  different  value  of  a.  Unless  we  do  a 

thorough  analysis  of  program  traces,  we  are  in  no  position  to  say  what  the 
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a   value  for  program  should  be.  We  cannot  afford  doing  this  in  our 
study.  Besides,  when  s  is  small,  say  2  or  3,  moderate  change  of  a  really 
does  not  make  too  much  difference  in  the  result.  Therefore,  we  prefer  not 
to  use  the  above  recursive  solution.  Instead  we  just  assume  no  serial ity 
between  references,  i.e., 

q(-S)  =  (1  -  P--)S 
A  good  discussion  of  memory  bandwidth  can  be  found  in  [43]  or  [39]. 


2.2.2  The  Simulator 

Our  simulator  uses  the  so-called  event-driven  technique,  that  is, 
the  whole  simulation  process  is  driven  by  a  sequence  of  event  times.  An 
event  time  is  the  time  that  an  event  occurs,  which  could  be  the  arrival  of 
a  job,  the  completion  of  a  processing  period  or  I/O  operation,  the  depar- 
ture of  a  job,  etc.  The  virtual  clock  is  advanced  from  the  current  time 
to  the  next  event  time.  Every  time  we  advance  the  clock  to  a  new  event 
time,  we  will  calculate  all  the  statistics  we  want  between  this  new  event 
time  and  the  previous  event  time,  and  update  the  system  status.  Then,  we 
will  use  this  new  status  information  to  figure  out  the  time  that  the  next 
event  will  occur.  This  process  will  keep  going  until  a  certain  number  of 
jobs  has  been  simulated.  In  a  simulation  where  the  timing  is  the  most  im- 
portant statistic,  the  event-driven  technique  is  the  most  convenient  and 
useful  tool . 

Figure  15  is  the  overall  structure  of  the  simulator,  and  Figure 
16  is  the  flow  chart  of  the  main  program.  Obviously,  the  most  important 
part  is  how  to  generate  the  next  event  time  (the  dotted  box  in  Figure  16). 
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Figure  15.  Overall  Structure  of  Our  Simulator, 
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Figure  16.  Flowchart  of  the  Simulator.  (Continued) 
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The  input  to  the  simulator  is  a  sequence  of  jobs.  Each  jobs  is 
a  four-tuple  that  consists  of  the  arrival  time  of  this  job,  as  well  as 
the  CPU  time  required  (in  milliseconds),  the  number  of  I/O  requests,  and 
the  memory  space  (in  K  bytes)  required  by  this  job.  Usually,  people  call 
this  kind  of  information  a  "workload."  We  use  two  kinds  of  workload  in 
our  analysis,  namely,  the  real  workload  and  the  artificial  workload.  The 
real  workload  is  obtained  from  the  System  Management  File  (SMF)  in  our 
IBM  360/75  system.  The  SMF  routines  store  on  tape  a  complete  record  of 
the  processing  information  of  all  the  jobs  run  on  the  360/75  every   day.  We 
pull  off  all  the  above  information  from  the  SMF  tapes  to  constitute  the 
input  workload  of  our  simulator.  On  the  other  hand,  we  can  use  random 
number  generators  to  generate  an  artificial  job  stream.  The  means  and 
variances  of  the  job  parameters  can  be  obtained  by  analyzing  the  real  data 
we  got  from  the  SMF  tapes,  and  these  can  be  used  by  the  random  number  gen- 
erators to  produce  fake  jobs.  The  real  workload  can  reflect  what  really  happened 
in  a  computer  system,  but  the  artificial  workload  is  easier  to  modify. 
We  will  use  both  and  compare  their  results. 

When  a  job  "arrives,"  that  is,  when  the  content  of  the  virtual 
clock  is  equal  to  or  greater  than  the  job  arrival  time,  it  will  be  placed 
into  the  waiting  queue  according  to  some  scheduling  algorithm.  The 
scheduling  algorithm  wil 1  greatly  influence  the  system  performance,  especial- 
ly the  average  turnaround  time.  In  the  next  chapter,  we  will  compare  some 
non-preemptive  scheduling  algorithms. 

The  jobs  in  the  waiting  queue  are  then  considered  for  entering 
the  system  according  to  their  ordering.  If  the  memory  has  enough  room  for 
the  job  under  consideration,  this  job  will  join  the  system  and  start  getting 
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service.  Of  course,  in  a  monoprogramming  system  this  job  also  should  have 
a  free  processor  assigned  to  it.  In  the  job  scanning,  we  usually  allow  a 
certain  distance  of  look-ahead.  This  means,  if  a  job  cannot  be  selected 
for  service,  we  are  allowed  to  go  down  the  line  and  look  at  the  next  job. 
This  scheme  might  improve  the  performance.  It  is  particularly  true  for 
short  look-ahead  distance.  But  for  long  look-ahead  distance,  we  might  get 
some  negative  result,  since  allowing  large  look-ahead  tends  to  reduce  the 
effect  of  the  scheduling  algorithm.  Our  result  in  the  next  chapter  indeed 
shows  this  phenomenon. 

One  negative  side  of  the  look-ahead  scheme  is  that  a  job  requiring 
large  memory  space  might  get  blocked  all  the  time  since  smaller  jobs  might 
sneak  in  and  never  leave  enough  space  for  this  big  job.  In  order  to  avoid 
this  problem,  we  adopt  the  following  strategy.  When  a  job  first  joins  the 
waiting  queue,  we  will  attach  a  count  to  this  job.  Later  on,  we  will  reduce 
this  count  by  one  e\/ery   time  we  scan  this  job.  When  this  count  becomes  zero 
and  has  not  been  served,  we  will  stop  looking  ahead  and  force  the  system  to 
accept  this  job  eventually.  In  the  360/75  system,  people  use  a  similar 
method  to  accomplish  this,  namely,  by  gradually  reducing  the  magic  number 
attached  to  a  job.  Of  course,  the  360/75  also  uses  a  more  complicated  method, 
viz.  adjusting  the  job  initiators  to  give  large  jobs  more  chance  to  get  into 
the  memory. 

When  a  job  gets  into  the  memory,  we  first  partition  its  CPU  time 
into  as  many  pieces  as  the  number  of  I/O  requests  required  by  this  job  in 
the  following  way.  Let  us  assume  the  job  requires  I   I/O  requests.  We  gen- 
erate I  exponentially  distributed  random  numbers  in  (0,1),  sum  them  together, 
divide  the  total  CPU  time  by  this  sum  to  get  a  proportionality  constant,  and 
then  multiply  each  original  random  number  by  this  constant  to  become  the 
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length  of  each  piece.   It  is  easy  to  show  the  sum  of  these  normalized 
pieces  is  equal  to  the  total  CPU  time. 

The  reason  we  are   doing  this  is  that  we  do  not  know  the  length 
of  the  time  period  between  two  I/O  requests.  Although  the  SMF  tapes  do 
contain  this  information,  it  is  difficult  and  time-consuming  to  get  it 
from  the  tapes. 

During  the  simulation  process,  a  job  will  go  to  the  I/O  stage 
after  one  segment  of  CPU  time  has  been  served.   It  will  then  do  some  I/O 
and  return  to  process  the  next  segment.  Each  I/O  operation  will  be  assumed 
to  be  nominally  42  ms  long,  which  is  of  the  same  order  as  a  disk  operation 
on  a  2314  disk  unit.  Of  course,  this  parameter  can  be  changed. 

As  we  said  before,  the  virtual  clock  of  the  simulator  advances 
from  an  event  time  to  the  next  event  time.  The  next  event  time  depends  on 
how  fast  the  system  operates,  and  this  in  turn  depends  on  the  memory  band- 
width. The  faster  we  can  get  data  out  of  the  memory,  the  faster  the  sys- 
tem can  operate. 

The  memory  bandwidth  is  a  function  of  several  variables,  for 
example,  the  processor-memory  speed  ratio,  the  number  of  active  processors, 
the  number  of  memory  modules,  and  the  memory  allocation  scheme.   In  the 
last  section,  we  showed  a  general  equation  which  will  be  used  in  our  simula- 
tor. This  equation  is  used  in  the  dotted  box  of  Figure  16.  However,  one 
thing  we  still  have  not  mentioned,  i.e.,  p..  in  equation  (6).  p.-  depends 
on  what  percentage  of  the  program  resides  in  this  module  and  how  we  allocate 
memory  to  the  program.   In  our  simulator,  we  will  assume  the  program  is 
interleaved  horizontally,  i.e.,  across  the  memory  modules.  So,  it  is  fair 
to  assume  p  •  is  the  fraction  of  the  program  residing  in  a  certain  module. 
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In  other  words,  if  a  program  is  distributed  in  four  memory  modules,  then 
all  four  p.'s  will  be  assumed  to  be  1/4.  Of  course,  some  people  might 
argue  about  this,  but  we  think  it  is  a  reasonable  assumption. 

At  any  time  instant,  how  fast  those  processor  can  run  is  deter- 
mined by  the  "instantaneous"  bandwidth.  By  instantaneous  bandwidth,  we 
mean  the  memory  bandwidth  of  the  current  state,  or  before  the  state  change. 
This  is  because  the  memory  bandwidth  is  state-dependent,  i.e.,  depends  on 
how  many  processors  are  running. 

Once  the  instantaneous  bandwidth  has  been  figured  out,  we  will 
distribute  it  to  all  the  "active"  processors.  The  share  each  processor 
will  get  is  proportional  to  the  contribution  it  makes  toward  the  total 
bandwidth.  This  partial  bandwidth  can  be  viewed  as  the  processing  power 
the  processor  uses  to  execute  a  job.  We  can  then  figure  out  the  time  that 
the  next  event  will  occur  and  the  amount  of  work  each  processor  has  done 
between  two  event  times.  In  general,  the  instantaneous  bandwidths  of  two 
intervals  will  be  different,  because  the  number  of  active  processors  might 
be  different. 

Now,  let  us  give  a  simple  example  to  show  how  to  calculate  the 
memory  bandwidth  and  compute  the  next  event  time.  Assume  we  have  a  distri- 
buted system  with  eight  memory  modules  (m=8)  and  a  speed  ratio  of  two  (s=2), 
Also,  assume  we  have  three  jobs  in  execution  that  require  10.5,  13.0.  and 
20.8  miliseconds  CPU  time  until  their  next  I/O  operations,  and  the  current 
time  T  equals  2145.32  (all  these  figures  are  arbitrarily  chosen).  Since 
we  are  using  the  distributed  memory  allocation  scheme,  all  p-.'s  will  be 
equal  to  1/8. 

First,  we  have  to  figure  out  the  total  instantaneous  bandwidth. 
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This  can  be  obtained  by  using  the  following  formula: 


B   =8-2   2  q  , 

w         -i  -_i   •  • 

J=\     -L-\        ±j 

where  q . -  =  1-p •  ..  Since  q  ■  =  7/8,  we  can  get  B  =  4.41 .  If  we  assume 
*-i  ^J  ^J  w 

another  memory  allocation  scheme,  e.g.,  the  partitioned  or  mixed  scheme, 
the  calculation  of  the  total  instantaneous  bandwidth  is  very  similar, 
although  it  will  be  a  little  bit  more  complicated.  In  the  next  chapter, 
we  will  show  the  calculation  for  a  partitioned  system.  Since  we  are  using 
the  distributed  scheme,  the  total  bandwidth  will  be  equally  distributed 
to  all  the  jobs.  Therefore,  each  processor  will  get  a  bandwidth  of  1.47 
to  execute  the  job. 

Now,  we  can  compute  the  next  event  time  by  using  all  the  informa- 
tion provided  above.  Apparently,  the  job  with  10.5  mil  1  iseconds  of  work  will 
be  done  and  go  to  the  I/O  stage  first.  In  other  words,  the  next  event  time 
is  the  time  instant  this  job  stops  processing  and  issues  an  I/O  request. 
The  other  two  jobs  will  still  be  in  execution  by  their  processors  by  then, 
since  all  the  jobs  get  the  same  amount  of  bandwidth  and  will  progress  at 
the  same  time.  So,  the  problem  is  to  find  out  how  long  it  will  take  to 
finish  10.5  mil iseconds  of  work,  given  the  processor  has  a  memory  bandwidth 
of  1.47.  We  can  solve  this  by  minipulating  the  dimensions. 

Let  us  assume  w  to  be  the  average  number  of  memory  references  a 
processor  will  issue  every  mill  isecondsand  w'  to  be  the  number  of  memory 
cycles  per  mill iseconds.  The  dimension  of  memory  bandwidth  is  the  number  of 
memory  references  per  memory  cycle.  So,  the  time  to  finish  10.5  ms  of  work 
is: 
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at  =    10.5  ms  x  w  mr/ms      7  lzl  _w_ 
1     1.47  mr/mc  x  w'  mc/ms    ,,IHwI   s' 

If  we  assume  a  processor  issues  a  memory  reference  every   processor  cycle, 
then  w/w'  is  equal  to  the  memory-processor  speed  ratio  s.  (In  fact,  we  define 
processor  cycle  time  to  be  the  average  period  of  time  a  processor  will  take 
to  issue  a  memory  reference.)  Hence,  AT  is  equal  to  7.14  x  s  =  14.28  ms. 
The  next  event  time  will  be  T  +  AT  =  2145.32  +  14.28  =  2159.6  ms.  In  other 
words,  the  first  job  will  stop  processing  and  issue  an  I/O  request  time  at 
time  instant  2159.5.  Since  we  assume  each  I/O  operation  will  take  a  constant 
amount  of  time,  we  can  know  the  time  this  job  will  finish  the  I/O  operation 
if  it  gets  the  I/O  device  it  wants. 

Of  course,  we  must  assume  that  nothing  else  will  happen  between 
the  current  time  T  and  T  +  AT.  For  example,  if  a  fourth  job  finishes  an 
I/O  transaction  and  resumes  processing  between  these  two  time  instants,  the 
next  time  will  be  the  time  instant  that  this  fourth  job  resumes  processing. 
Hence,  instead  of  computing  AT,  we  have  to  figure  out  how  much  work  will  have 
been  done  between  the  current  time  and  the  time  instant  the  above  event 
occurs.  We  then  have  to  subtract  this  from  the  processing  time  of  each  job 
and  repeat  the  whole  thing  with  four  jobs. 

We  can  see  that  the  principle  behind  our  simulator  is  quite  simple. 
However,  it  is  a  very   useful  tool,  and  we  use  it  to  generate  all  our  results. 
In  the  next  chapter,  we  will  show  some  interesting  results  we  have  obtained. 
Before  we  do  that,  we  will  define  some  of  the  measurements  which  will  be 
used  in  the  later  discussion. 
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2.2.3  Definitions  of  System  Measurements 

In  order  to  give  a  more  succint  presentation  of  the  simulation 
results,  v/e  will  use  the  following  definitions  very   often  in  the  next  chap- 
ter. As  always,  p,  m,  and  r  denote  the  numbers  of  processors,  memory 
modules,  and  I/O  devices  in  the  system.  The  memory-processor  speed  ratio, 
i.e.,  the  ratio  of  memory  cycle  time  and  processor  cycle  time,  is  denoted 
by  s. 

Ta  will  be  used  to  denote  the  average  turnaround  time,  i.e.,  the 
average  amount  of  time  a  job  will  spend  in  the  system.  Ta  can  be  broken 
into  two  parts,  namely,  the  average  queueing  time  q  and  the  average  service 
time  e.  In  other  words,  Ta  =  q  +  e.  q  is  the  average  amount  of  time  a  job 
has  to  spend  in  the  outside  waiting  queue,  or  the  time  period  from  the  moment 
the  job  arrives  to  the  moment  it  enters  the  memory.  In  a  multi programmed 
system,  q  is  usually  caused  by  insufficient  memory  space.  In  a  monoprogrammed 
system,  this  could  also  be  caused  by  lack  of  a  free  processor.  Sometimes, 
we  might  put  a  superscript  on  q  to  denote  the  queueing  time  which  occurs  some- 
where else.  For  example,  q   represents  the  queueing  time  which  occurs  in 
the  I/O  queue.  On  the  other  hand,  e  denotes  the  average  amount  of  time  a 
job  will  spend  in  the  memory,  which  can  further  be  broken  into  processing 
time,  I/O  time,  and  some  possible  delays  due  to  queueing  for  resources, 
e.g.,  q10.  We  gave  a  graphical  representation  of  these  parameters  earlier 
in  Figure  7. 

n  will  denote  the  average  number  of  jobs  in  the  whole  system. 
Again,  a  superscript  will  be  used  to  denote  which  part  of  the  system  we  are 
talking  about.  For  example,  n^  represents  the  average  number  of  jobs  in 
the  outside  waiting  queue. 
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B1(  is  used  to  represent  the  memory  bandwidth.  A  superscript  will 
w 

dm      p 

be  used  to  indicate  the  memory  allocation  scheme  we  use.  So,  B.  B.  and  B(i 

•*  '  w'  w      w 

will  denote  the  memory  bandwidths  for  distributed,  mixed,  and  partitioned 
systems  respectively. 

Um  and  U_  will  denote  the  utilizations  of  memory  and  processors, 
U  is  the  average  fraction  of  the  memory  which  will  be  occupied  by  jobs. 
We  will  explain  later  that  there  are  two  kinds  of  memory  utilization  in  a 
partitioned  system  due  to  an  unusual  way  of  allocating  memory.  U   on  the 
other  hand,  is  the  fraction  of  time  a  processor  is  busy  executing  a  job. 
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Chapter  3 
EXPERIMENTAL  RESULTS 


3.1  Results  for  Software  Related  Questions 

We  roughly  described  some  interesting  design  problems  in  Section 
1.4.  In  this  chapter,  we  are  going  to  present  a  lot  of  simulation  results 
to  answer  these  problems.  Each  problem  is  affected  by  a  number  of  variables, 
and  we  will  include  the  effects  of  as  many  of  these  variables  as  possible. 
Basically,  we  will  follow  the  same  order  as  that  in  Chapter  1.  Therefore, 
we  will  start  with  software  related  questions. 

Before  we  proceed,  a  word  of  caution  is  in  order.   In  sections 
3.1.2  and  3.1.3  which  deal  with  monoprogramming  versus  multiprogramming  and 
the  memory  allocation  schemes,  the  reader  will  quickly  come  to  the  conclu- 
sion that  multiprogramming  and  the  distributed  memory  allocation  scheme  are 
superior  in  terms  of  performance.  This  is  in  fact  true.  However,  the 
results  in  these  two  sections  assume  the  existence  of  complete  processor  to 
memory  connection,  i.e.,  any  processor  can  access  any  memory  module 
subject  only  to  possible  momentary  delays  due  to  memory  conflicts.  As  is 
well  known,  such  a  connection  network  gets  very   expensive  as  the  system 
grows  and  is  difficult  to  expand.  The  effectiveness  of  multiprogramming 
and  the  distributed  scheme  depends  to  a  large  extent  on  this  complete  but 
expensive  connection  capability. 

Thus,  we  are   interested  primarily  in  the  degree  of  degradation 
due  to  monoprogramming,  whole  module  memory  allocation,  the  poorer  band- 
width resulting  from  partitioned  and  mixed  schemes,  and  various  factors 
which  affect  this  degradation. 

In  later  sections  of  this  chapter  we  will  present  similar  results 
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using  partial  connection  networks.  We  will  see  the  advantages  of  multi- 
programming and  distributed  scheme  diminish  considerably  in  these  cases. 

Before  we  talk  about  the  first  software  question,  we  should  first 
describe  some  properties  of  the  workload  used  in  the  simulation,  since  the 
workload  will  greatly  influence  system  performance.  Although  we  are  going 
to  discuss  how  a  change  of  the  workload  will  affect  the  result,  we  feel  that 
a  description  is  needed  here  in  order  to  give  the  reader  a  better  understanding 
of  the  whole  discussion. 

3.1.1  The  Workload 

As  we  said  in  Chapter  2,  our  workload  is  a  sequence  of  four-tuples. 
Each  four-tuple  consists  of  four  pieces  of  information  about  a  job:  the 
arrival  time,  the  CPU  time,  the  memory  requirement,  and  the  number  of  I/O 
requests.  The  original  source  of  this  information  is  the  IBM  SMF  tape. 
Table  3  displays  some  statistical  data  on  these  parameters,  which  were  ob- 
tained by  analyzing  1300  real  jobs  which  were  run  on  the  University's  IBM 
360/75  system.  Note  that  we  show  the  mean  and  the  standard  deviation  of  the 
job  interarrival  time  instead  of  the  job  arrival  time.  This  is  because 
the  job  arrival  time  is  an  absolute  measurement  which  does  not  show  the 
distance  between  two  arrivals  unless  we  lay  out  all  the  arrival  times.  On 
the  other  hand,  the  job  interarrival  time  is  a  relative  measurement  which 
can  give  us  some  idea  how  fast  the  jobs  arrive. 

The  data  in  Table  3  are  obtained  directly  from  the  SMF  tape. 
Sometimes  we  will  scale  this  data  in  order  to  properly  load  our  system. 
For  example,  if  we  want  to  see  the  effect  of  doubling  the  system  load,  we 
can  achieve  this  by  reducing  the  arrivial  times  by  one  half,  thus  making 
it  appear  as  though  the  jobs  are  arriving  twice  as  fast.  Scaling  will  be 
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Data 

Mean 

a 

Unit 

Interarrival  Time 

6.87 

6.96 

Sec. 

CPU  Time 

22.68 

22.40 

Sec. 

Job  Size 

117 

80 

K  bytes 

I/O  Requests 

757 

739 

No. /Job 

Table  3.  Some  Statistical  Data  of  the  Workload 
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used  when  we  are  studying  the  effect  of  various  workloads. 

The  real  workload,  of  course,  reflects  what  really  happens  in  a 
computer  system.  However,  it  is  very   difficult  to  modify  or  enlarge.  For 
example,  if  we  want  a  job  stream  which  is  twice  as  long  as  what  we  have  now, 
then  we  will  have  to  get  the  second  half  from  the  SMF  tape  and  append  it  to 
the  first  half.  If  we  are  unlucky,  the  second  half  might  have  completely 
different  characteristics  from  the  first  half.  For  example,  if  the  next  day 
is  the  due  day  of  a  CS101  machine  problem,  the  number  of  small  jobs 
submitted  to  the  system  is  suddenly  doubled  or  tripled,  which  greatly  perturbs 
the  characteristics  of  the  workload.  This  is  a  very   undesirable  thing  in 
doing  simulation.  In  addition,  it  is  very   difficult  to  modify  some  job 
parameters,  e.g.,  the  standard  deviations  and  the  distributions.  Most  of 
all,  it  only  represents  the  workload  on  our  360/75  system,  and  we  would  like 
to  see  the  result  of  a  more  general  job  stream.  Therefore,  we  will  use 
another  method  we  mentioned  in  Chapter  2,  i.e.,  to  produce  an  artificial 
workload  by  using  random  number  generators. 

To  generate  an  artificial  workload,  we  have  to  know  the  distribu- 
tions as  well  as  the  means  and  the  standard  deviations  of  all  four  parameters. 
Of  course,  we  can  arbitrarily  make  up  this  information.  However,  in  order 
to  maintain  some  reality,  we  will  obtain  them  by  analyzing  the  real  work- 
load. 

Our  analysis  shows  that  the  distributions  of  the  interarrival 
time,  the  CPU  time,  and  the  number  of  I/O  requests  are  approximately  exponen- 
tial functions  with  the  means  and  standard  deviations  shown  in  Table  2. 
Therefore,  we  can  easily  reproduce  them  by  using  the  following  equation: 
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y  =  m  loge  l/(l-x) 

where  m  is  the  mean,  x  is  a  uniformly  distributed  random  number  in  (0,1), 
and  the  resulting  y  is  an  exponentially  distributed  random  number.  The 
proof  can  be  found  in  [44]. 

However,  the  distribution  of  the  job  size  is  not  so  simple. 
Figure  7  shows  the  density  curve  of  the  job  size.  It  contains  two  bumps, 
one  at  around  20K  bytes  and  the  other  at  around  120K  bytes.  Of  course, 
this  depends  heavily  on  the  system.  The  analysis  by  Chandy,  et  al  [33] 
also  shows  the  same  phenomena.  It  is  very  difficult  to  write  down  an  equa- 
tion and  compute  the  inverse  function,  as  we  did  above.  We  have  to  do  this 
numerically,  i.e.,  get  the  cummulative  probability  function  F(y),  generate 
a  uniformly  distributed  random  number  x  in  (0,1),  and  then  compute  the 
inverse  function 


y  =  f-](x) 


In  fact,  this  is  the  basic  method  of  generating  a  generally 
distributed  random  number.  However,  it  is  very  time-consuming  since  it 
involves  a  searching  procedure  to  determine  the  interval  x  as  in,  and 
perhaps  an  interpolation  if  we  want  a  more  accurate  value.  But,  this  is 
all  we  can  do  to  handle  a  general  distribution.  We  will  use  this  method 
for  generating  the  job  size. 

After  knowing  these  distributions  and  methods,  we  can  generate 
four  sequences  of  random  numbers  to  form  the  artificial  workload.  This 
method  is  \/ery   flexible  since  we  can  produce  a  workload  with  any  character- 
istics we  want.  Most  of  the  simulations  will  use  the  artificial  workload. 


81 


NUMBER  OF  JOBS 
260  - 


STATISTICS  OF  1300  JOBS 


20  40  60  80  100  120  140  160  180  200  220  240  260  280  300 


JOB  SIZE 
(K  Bytes) 


Figure  17.  The  Density  of  the  Job  Size. 
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Of  course,  one  disadvantage  of  this  method  is  these  parameters  are  nov\/ 
completely  independent  of  each  other.   In  the  real  job  stream,  there  might 
be  some  correlation  among  them,  for  example,  a  job  requiring  very   large 
space  might  use  long  CPU  time  and  do  a  large  amount  of  I/O  operations. 
So,  there  is  some  difference  between  the  artificial  workload  and  the  real 
workload. 

These  large  jobs  will,  of  course,  seriously  degrade  the  average 
turnaround  time.  If  one  of  these  jobs  gets  into  the  memory,  it  will  occupy 
a  large  portion  of  the  memory  for  a  long  time.  This  will  then  block  the 
jobs  in  the  waiting  queue  from  being  executed  until  it  finally  gets  done 
and  releases  the  memory.  Hence,  all  of  the  waiting  jobs  suffer  a  very   long 
delay  which  causes  a  significant  increase  of  the  total  average  turnaround 
time. 

Figure  18  shows  some  simulation  results  using  both  real  and  artifi- 
cial workloads,  where  we  use  a  system  with  1024K  bytes  of  main  memory.  Six- 
teen memory  modules,  constant  I/O  time  of  42  ms.,  memory-processor  speed 
ratio  of  4,  monoprogramming,  shortest-job-first  scheduling  algorithm,  the 
distributed  scheme  of  memory  allocation,  a  full  connection  switching  net- 
work, and  800  jobs.  All  these  parameters  were  explained  in  the  last  chapter 
and  are  indicated  in  the  figure. 

As  we  can  see  from  this  figure,  there  is  a  significant  difference 
between  the  two  curves.  After  some  analysis,  we  find  that  this  difference 
is  indeed  caused  by  a  few  very   large  jobs  in  the  real  job  stream.  Each 
of  these  jobs  claims  a  large  amount  of  space  and  requires  a  large  CPU  time, 
so  they  contribute  a  significant  amount  of  queueing  delay  to  the  final 
average  turnaround  time.  However,  this  does  not  happen  in  the  artificial 
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Figure  18.  A  Comparison  of  the  Turnaround  Times  of 
the  Real  and  Artificial  Workloads. 
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workload. 

If  we  delete  these  big  jobs  and  run  the  simulation  again,  we  get 
a  result  which  is  just  a  1  ittle  bit  higher  than  the  result  using  the  artificial 
workload.  Therefore,  we  believe  that  the  artificial  workload  is  a  pretty 
good  approximation  of  the  real  workload  if  we  ignore  a  few  big  jobs  which 
occasionally  occur  in  the  job  stream. 

However,  we  are  not  saying  that  we  will  ignore  the  existence  of 
these  big  jobs.  Actually,  this  kind  of  job  always  exists  in  a  typical 
university  workload.  Most  of  them  are  so-called  number-crunching  jobs 
since  they  require  a  large  amount  of  floating-point  operations.  A  number- 
crunching  job  needs  a  processor  with  fast  floating-point  arithmetic  unit 
for  a  fast,  efficient  execution.  Most  of  the  minicomputers  and  the  micro- 
processors, however,  do  not  provide  floating-point  hardware.  The  floating- 
point operations  are  done  by  software  or  microprogrammed  subroutines.  So, 
the  number-crunching  jobs  are  not  suitable  for  these  small  machines.  Al- 
though a  few  minicomputers  which  came  out  recently  do  have  floating-point 
hardware,  e.g.,  PDP  11/70,  they  still  cannot  provide  a  fast  and  efficient 
execution  for  this  type  of  job. 

The  best  way  to  handle  these  big  jobs  is  to  use  a  big  machine  like 
the  CDC  7600  or  Amdahl  470.  These  machines  all  have  fast  pipe-lined 
floating-point  arithmetic  units,  which  can  execute  a  number-crunching  job 
yery   quickly  and  efficiently.  This  is  why  these  big  machines  are  important 
in  the  computation  world. 

Although  a  minicomputer  or  a  microprocessor  is  not  appropriate 
in  handling  these  big  jobs,  we  should  not  consider  this  as  a  fatal  dis- 
advantage of  building  multiprocessor  systems  by  using  minicomputer  or 
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microprocessors.  We  can  easily  solve,  actually  we  should  say  "avoid,"  this 
problem  by  the  following  method.  Whenever  a  big  job  arrives,  we  can  send 
it  to  a  big  machine  elsewhere  via  a  computer  network,  which  can  give  better 
service  to  this  job.  This  is  why  we  use  the  artificial  workload  since  it 
approximates  the  real  workload  without  those  big  jobs. 

In  fact,  it  does  not  matter  which  workload  we  use  in  our  simula- 
tion, since  we  are  doing  comparison  work  or  finding  the  effect  of  a  certain 
parameter.  However,  the  artificial  workload  seems  to  be  more  convenient 
for  us  to  use,  and  so  we  will  use  it  in  the  following  discussion.  The  real 
job  stream  will  only  be  used  to  provide  the  necessary  information,  e.g., 
mean,  variance,  and  distribution,  in  generating  the  artificial  job  stream. 

One  thing  we  would  like  to  point  out  here  is  the  absolute  value  of 
a  certain  measurement,  for  example,  150  sec.  average  turnaround  time  in 
general  does  not  have  too  much  meaning  alone.  Only  the  relative  magnitude 
or  the  percentage  of  difference  can  indicate  the  goodness  of  a  certain  sys- 
tem over  the  other  one.  We  will  try  to  use  percentages  in  our  presentation. 

3.1.2  Monoprogramming  versus  Multiprogramming 

Our  first  problem  is  to  compare  monoprogramming  and  multiprogram- 
ming. As  we  said  in  Chapter  1,  we  are  more  interested  in  monoprocessing 
systems,  i.e.,  each  job  can  only  be  run  on  one  processor  like  in  the  PRIME 
system.  Hence,  we  are  not  talking  about  ILLIAC-IV  type  machines.  Using 
the  definition  of  Flynn'[45],  we  are  dealing  with  MIMD  type  of  machines, 
not  SIMD  type  of  machines. 

Again,  we  want  to  remind  the  reader  that  all  the  results  in 
Section  3.1  are  using  a  full  connection. 
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By  monoprogramming,  as  was  defined  in  Chapter  1,  we  mean  each 
processor  is  dedicated  to  a  job  once  it  is  assigned  to  this  job,  and  can 
execute  only  this  job  until  it  is  finished.  When  this  job  is  doing  I/O, 
the  processor  remains  idle.  In  other  words,  no  processing  and  I/O  over- 
lapping will  take  place.  Due  to  this  rule,  we  only  allow  at  most  as  many 

jobs  as  the  number  of  processors  in  the  system. 

On  the  other  hand,  in  a  multiprogramming  system,  we  will  pack  as 
many  jobs  as  possible  in  the  memory  and  let  these  jobs  share  all  the  pro- 
cessors. Once  a  processor  becomes  free,  it  will  try  to  obtain  a  job  from 
the  processor  queue  and  execute  it.  No  processor  will  be  left  idle  inten- 
tionally if  there  is  a  job  ready  to  be  executed. 

Of  course,  multiprogramming  can  result  in  higher  processor  utiliza- 
tion and  memory  utilization,  which  means  higher  system  throughput.  This 
in  general  implies  we  can  get  shorter  average  job  turnaround  time  (Ta). 
Figure  19  shows  Ta  versus  m  curves  for  both  monoprogramming  and  multiprogram- 
ming. For  p=4,  the  gap  between  two  curves  is  very   big,  monoprogramming  is 
about  60%  (182  vs  114)  higher  than  multiprogramming.  However,  when  p=6, 
the  gap  closes  to  about  13%  (102  vs  90). 

This  is  not  surprising,  since  when  we  increase  the  number  of  pro- 
cessors we  also  increase  the  maximum  number  of  jobs  allowed  in  the  memory 
in  monoprogramming.  Apparently,  for  six  processors  and  a  total  memory 
size  of  1024K  bytes,  monoprogramming  is  already  competitive  with  multi- 
programming. 

Figure  20  shows  how  the  monoprogramming  curve  approaches  the 
multiprogramming  curve  when  we  increase  the  number  of  processors.  As  we 
can  see,  for  small  p,  multiprogramming  does  show  a  superiority  over 
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Figure  19.  Comparison  of  Monoprogramming  and  Multiprogramming, 
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Figure  20.  The  Effect  of  Increasing  the  Number  of  Processors 
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monoprogramming.  But  with  a  moderate  number  of  processors,  e.g.,  6  in  this 
case,  two  results  are  more  or  less  the  same  already.  The  reason  is  rather 
simple,  since  for  the  job  size  distribution  we  use,  the  main  memory  can 
only  contain  about  six  jobs  for  most  of  the  time.  Hence,  there  is  really 
no  reason  we  should  use  multiprogramming  if  we  can  afford  "enough"  proces- 
sors. Of  course,  "enough"  is  determined  by  the  total  main  memory  size  and 
the  distribution  of  the  job  size. 

Besides,  we  have  not  taken  the  software  overhead  of  multiprogram- 
ming into  account.  By  overhead  here,  we  mean  the  extra  work  the  multi- 
programmed  operating  system  must  do,  e.g.,  updating  the  outgoing  user's 
file,  restoring  the  incoming  user's  status  information,  etc.  We  do  not  know 
exactly  how  high  this  overhead  will  be.  However,  in  most  computer  systems 
a  large  portion  of  CPU  time  is  spent  in  the  operating  system.  In  other 
words,  the  processors  will  have  to  do  more  work  in  a  multi programmed  sys- 
tem than  in  a  monoprogrammed  system.   If  we  assume  this  overhead  will 
increase  10%  of  the  job  CPU  time,  then  the  Ta  curve  for  multiprogramming 
will  move  up  to  the  dotted  curve  shown  in  Figure  20..  Now,  monoprogramming 
wins  for  p  >  6. 

The  effect  of  this  10%  overhead  is  different  for  each  p.  It 
causes  a  larger  increment  of  Ta  for  smaller  p.  For  larger  p,  say  8,  the 
increment  is  about  10%,  and  for  smaller  p,  say  four  or  less,  the  increment 
is  more  than  20%.  Apparently,  the  overhead  is  \/ery   important  when  the 
number  of  processors  is  small. 

This  phenomenon  can  be  explained  by  Table  4,  where  for  each  p 
value  we  show  the  degree  of  multiprogramming  and  the  queueing  delay  due  to 
no  available  processor.  The  degree  of  multiprogramming  is  defined  to  be 
the  average  number  of  jobs  each  processor  will  have  to  take  care  of  at  the 
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p 

Degree  of 
Multiprogramming 

Queueing  Delay  Due 
to  No  Processor 

Queueing  Del  ay /Average 
Service  Time  {%) 

2 

4.22 

85 

57.7 

3 

2.61 

41 

39.0 

4 

1.71 

20 

23.3 

5 

1.35 

13 

17.3 

6 

1.17 

10 

15.4 

7 

1.08 

9 

14.3 

8 

1.03 

8 

11.2 

Table  4.  Degree  of  Multiprogramming  and  Queueing 
Delay  Due  to  No  Available  Processor. 
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same  time.  In  other  words,  on  the  average  there  will  be  p*  (degree  of 
multiprogramming)  jobs  in  the  memory  which  are   sharing  p  processors.  From 
Table  4,  we  can  see  the  degree  of  multiprogramming  is  high  when  p  is  small. 
This  means  a  lot  of  jobs  will  have  to  compete  for  a  few  processors,  and 
the  queueing  delay  caused  by  waiting  for  a  free  processor  will  be  very   high. 
In  the  second  and  third  columns,  we  show  the  queueing  delay  and  its  percen- 
tage in  the  total  service  time.  We  can  see  the  queueing  delay  of  a  two- 
processor  system  occupies  almost  60%  of  the  total  service  time  and  is  more 
than  ten  times  that  of  an  eight-processor  system.  If  we  add  an  overhead 
to  each  job,  the  queueing  delay  will  grow  as  a  function  of  the  degree  of 
multiprogramming  since  each  job  will  cause  a  certain  amount  of  delay  (the 
increment  of  the  CPU  time)  to  every  job  waiting  for  a  processor.  This  means 
that  the  overhead  will  result  in  longer  delay  for  smaller  p.  Therefore, 
the  averate  turnaround  time  will  increase  more  for  smaller  p.  We  will 
further  explain  this  in  Section  3.1.5. 

Before  we  add  the  overhead,  we  can  see  that  multiprogramming 
wins  by  a  wide  margin  when  p  is  small.  However,  the  difference  will  be 
reduced  significantly  if  we  just  add  a  moderate  amount  of  overhead.  The 
increment  of  the  average  turnaround  time  is  ^jery   sensitive  to  the  overhead, 
especially  for  small  p.  Therefore,  the  superiority  of  multiprogramming 
in  that  region  will  disappear  rather  quickly  if  the  overhead  goes  up.  If 
we  insist  on  using  multiprogramming,  the  software  design  will  be  extremely 
important.  A  bad  design  can  easily  degrade  the  performance  seriously. 

From  Figure  20,  we  can  see  another  interesting  point  about  these 
curves.  If  we  do  not  consider  the  overhead,  the  multiprogramming  curve 
is  only  two  processors  to  the  left  of  the  monoprogramming  curve.  If  we 
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use  two  more  processors  in  the  monoprogrammed  system,  the  monoprogramming 
curve  will  be  shifted  to  the  left  by  two  processors.  Thus,  two  curves  will 
almost  overlap.  In  fact,  in  this  case,  the  monoprogramming  curve  will  be 
slightly  under  the  multiprogramming  curve.  This  means  that  if  the  difference 
of  the  software  costs  is  less  than  the  cost  of  two  processors,  then  we 
should  use  monoprogramming  with  two  more  processors.  Although  we  do  not 
have  exact  figures  of  these  costs  so  we  can  draw  any  conclusion,  this  ap- 
parently is  the  case  in  the  current  trend  since  the  software  cost  is  soaring 
up  rapidly  and  the  hardware  cost  is  going  down  significantly  ewery   year. 

If  we  do  take  the  10%  overhead  into  account,  we  can  see  the  dif- 
ference is  only  about  one  processor.  Suppose  the  overhead  is  even  higher, 
the  monoprogramming  curve  might  become  completely  superior  to  the  multi- 
programming curve. 

However,  we  are  not  completely  against  multiprogramming.  In  some 
cases,  multiprogramming  is  still  a  better  solution  for  system  design.  For 
example,  if  we  have  a  job  mix  with  all  small  and  I/0-bound  jobs,  then 
multiprogramming  might  give  us  better  results.  By  I/0-bound,  we  mean  a 
job  which  spends  most  of  its  life  time  in  doing  I/O  and  relatively  small 
amount  of  time  in  processor.  In  other  words,  an  I/0-bound  job  will  have  an 
I/O  time  which  is,  say,  several  times  longer  than  its  CPU  time.  Most 
COBOL  programs,  for  example,  are  I/0-bound  under  this  definition. 

The  job  mix  we  are  using,  however,  is  not  I/0-bound.  This  can 
be  seen  from  the  mean  values  we  show  in  Table  3.  If  we  assume  an  I/O 
operation  takes  42  ms,  then  on  the  average,  a  job  will  spend  about  30 
seconds  in  doing  I/O,  which  is  of  the  same  order  as  the  average  CPU  time. 
Of  course,  the  real  average  CPU  time  will  not  be  the  same  as  that  shown 
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in  Table  3,  since  it  will  be  affected  by  several  factors,  e.g.,  the  memory 
allocation  scheme  (to  be  discussed  in  the  next  section),  the  degree  of  inter- 
leaving, the  memory  and  processor  speeds,  etc.  However,  our  simulation  results 
show  that  it  ranges  from  30  to  65  seconds.  So,  the  CPU  time  has  been 
increased  by  a  factor  of  1.35  to  2.9.  This  is  caused  by  several  factors. 
We  assume  a  processor  cycle  (pc)  to  be  the  average  amount  of  time  between 
two  successive  memory  references  issued  by  a  processor.  In  other  words, 
a  processor  will  generate  one  memory  reference  in  one  processor  cycle. 
Since  s  is  defined  to  be  the  ratio  between  the  memory  speed  and  the  proces- 
sor speed,  a  memory  cycle  (mc)  will  be  s  *  pc.  Let  us  also  assme  that  one 
CPU  second  is  equal  to  w  processor  cycles,  or  w  memory  references.  So, 
a  program  needs  10w  memory  references  if  its  CPU  time  requirement  is  10 
seconds.  If  the  average  memory  bandwidth  a  processor  can  get  is  b,  it  will 
take  w/b  memory  cycles  to  satisfy  one  CPU  second  of  work.  The  average  memory 
bandwidth  b  in  general  will  be  less  than  s  due  to  memory  interference. 
Since  pc  =  1/w  second  by  definition,  w/b  memory  cycles  is  equivalent  to 
s/b  seconds.  Since  b  is  bound  by  s,  s/b  will  always  be  greater  than  1. 
That  is,  it  always  takes  more  than  one  second  of  time  to  complete  one  CPU 
second  of  work.  This  explains  why  the  average  processing  time  ranges  from 
30  to  65  seconds,  instead  of  being  22.68  seconds  shown  in  Table  3.  There- 
fore, the  job  mix  we  are  using  is  not  I/0-bound  since  on  the  average  a  job 
will  spend  roughly  the  same  or  more  time  in  execution  than  doing  I/O. 

If  we  use  the  job  mix  with  all  small  and  I/0-bound  jobs,  the  re- 
sults will  be  quite  different  from  that  of  Figure  20.  Table  5  shows  the 
result  by  using  a  new  job  mix,  where  we  increase  the  number  of  I/O  requests 
of  each  job  by  50%  and  reduce  the  CPU  time  and  the  job  size  by  25%  each. 
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Multi- 
programming 

Ml 
(Wi 

It' 

th 

Ta 
'programming 
10%' Overhead) 

pr 

Mono- 
ogramming 

4 

90 

94 

671 

5 

86 

90 

204 

6 

83 

87 

132 

7 

83 

88 

107 

8 

84 

88 

90 

Table  5.  The  Results  by  Using  New  Job  Mix. 
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The  average  I/O  time  is  now  about  twice  as  large  as  the  average  CPU  time 
(46  to  24).  Therefore,  the  new  job  mix  is  I/0-bound  with  smaller  job 
sizes.  We  only  show  the  results  for  p=4  to  8.  As  we  can  see,  the  gap 
between  multiprogramming  and  monoprogramming  is  rather  large  even  for  p  as 
large  as  7.  If  we  add  in  10%  overhead  as  we  did  earlier,  multiprogramming 
(middle  column  of  Table  5)  still  wins  by  a  slight  margin  for  p=8.  Ap- 
parently, multiprogramming  will  yield  better  results  for  the  job  mix  which 
contains  small  and  I/0-bound  jobs. 

Therefore,  which  strategy  we  should  use  depends  on  the  job  mix 
we  are  dealing  with.  Our  conclusions  in  this  report  are  all  based  on  the 
job  mix  we  described  in  the  early  part  of  this  chapter,  which  is  a  typical 
workload  on  a  university  batch  system. 

Table  6  shows  the  memory  utilization  and  processor  utilization  of 
both  monoprogramming  and  multiprogramming  for  the  system  described  in 
Figure  20.  The  memory  utilization  of  monoprogramming  goes  up  as  p  increases, 
and  the  memory  utilization  of  multiprogramming  essentially  remains  the  same. 
This  is  what  we  would  expect.  The  processor  utilizations,  on  the  other 
hand,  both  decline  as  p  increases.  This  is  because  U  is  the  utilization 
of  one  processor,  in  other  words  it  is  the  "normalized"  utilization.  If 
we  multiply  U  by  p,  we  can  see  the  results  are  increasing.  This  shows  the 
same  trend  as  the  Ta  curve  in  Figure  22.  Therefore,  we  can  think  of  U  *p 
as  a  representation  of  the  work  being  done  by  the  system  in  a  certain  time 
unit. 

In  fact,  the  processor  utilization  is  strongly  related  to  the 
system  performance.  If  U  *p  is  higher,  the  processors  can  finish  more 
work  in  one  unit  of  time,  and  the  average  turnaround  time  will  be  lower. 
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Monoprogramming 

Multi 

programming 

p 

Um 

% 

U  *p 

m 

up 

yp 

4 

.44 

.50 

2.00 

.69 

.65 

2.60 

6 

.59 

.44 

2.64 

.64 

.45 

2.70 

8 

.61 

.34 

2.72 

.62 

.32 

2.72 

Table  6.  Hardware  Utilizations  for  Monoprogramming  and 
Multiprogramming. 
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Therefore,  Ta  can  indirectly  tell  us  what  the  relative  processor  utilization 

should  be.  In  the  following  discussion,  we  will  not  show  the  utilizations 

except  for  a  few  occassions,  since  they  will  act  like  that  in  Table  6. 

The  other  measurements  that  can  also  indicate  the  work  being  done 

by  the  system  is  the  total  memory  bandwidth  B(1   The  total  memory  bandwidth 

w 

is  the  memory  bandwidth  generated  by  all  the  active  processors,  i.e.,  pro- 
cessors that  are  executing  jobs.  Like  U  *p,  the  higher  the  total  memory 
bandwidth  is,  the  faster  the  processors  will  operate.  Figure  21  shows  B 
versus  p  curves  for  the  system  of  Figure  20.  As  we  can  see,  the  total  memory 
bandwidths  of  the  monoprogrammed  and  multi programmed  systems  both  go  up  as 
we  increase  the  number  of  processors.  The  multi prog  rammed  system  has  a 
higher  total  memory  badnwidth.  However,  when  p  is  large,  say  8,  the  total 
memory  bandwidths  of  both  systems  are  \/ery   close  to  each  other.  Recall 
Figure  20,  the  average  turnaround  times  of  both  systems  also  become  very 
close  to  each  other  as  p  gets  large. 

3.1.3  Memory  Allocation  Schemes 

As  we  described  in  Figure  5  of  Chapter  1,  we  are  interested  in 
three  kinds  of  memory  allocation  scheme:  the  partitioned  scheme  (Figure  5-a), 
the  distributed  scheme  (Figure  5-b),  and  the  mixed  scheme  (Figure  5-c). 
We  briefly  explained  there  how  they  work  and  their  advantages  and  disadvan- 
tages (Table  2).  In  this  section,  we  are  going  to  investigate  their  per- 
formance and  look  at  some  problems  related  to  these  schemes. 

As  we  shall  see,  the  memory  allocation  scheme  affects  performance 
in  three  different  ways.  First,  the  space  efficiency  of  an  allocation  will 
affect  the  number  of  jobs  which  can  be  in  the  memory  at  any  time.  The 
partitioned  scheme  tends  to  waste  memory  since  memory  can  only  be  allocated 
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Figure  21.  The  Total  Memory  Bandwidth  for  the  System 
of  Figure  20. 
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in  whole  modules.  The  other  two  allocation  schemes  do  not  waste  any  memory 
in  this  way.  Hence,  the  partitioned  scheme  has  less  potential  to  pack  jobs 
in  the  memory  than  the  other  two  schemes.  Second,  the  allocation  scheme 
affects  the  memory  bandwidth  available  to  any  given  job.  If,  for  example, 
a  job  requires  a  small  amount  of  memory  (less  than  one  module),  then 
under  the  partitioned  scheme  or  the  mixed  scheme  only  one  module  will  be 
allocated  to  this  job  and  the  memory  bandwidth  is  limited  to  1 .  On  the 
other  hand,  distributed  allocation  causes  the  job  to  be  spread  across  all 
memory  modules  thus  allowing  a  higher  potential  bandwidth  (although  this 
bandwidth  is  subject  to  interference  from  other  jobs  in  the  memory).  Sup- 
pose, however,  that  the  job  requires  four  memory  modules.  It  may  well  get 
worse  bandwidth  using  the  distributed  scheme  than  that  using  the  partitioned 
scheme  since  the  latter  case  is  not  subject  to  interference  from  the  other 
jobs.  Finally,  the  third  effect  of  allocation  on  performance  has  to  do 
with  the  classical  problem  of  address  interleaving  which  affects  the  ability 
of  a  job  to  utilize  the  potential  memory  bandwidth.  This  question  has 
been  discussed  extensively  in  the  literature  [39,42,46,47,48].  We  see  no 
way  of  providing  definitive  answers  in  this  area  short  of  using 
actual  address  streams.  But,  this  would  lead  to  results  of  questionable 
generality  and  would  be  prohibitively  expensive.  We  will,  however,  establish 
some  bounds  on  the  possible  effects  of  good  versus  bad  address  interleaving. 

The  factors  above  interact  in  complex  and  frequently  unpredicta- 
ble ways.  We  will  attempt  to  isolate  the  effects  of  each  factor  as  much 
as  possible.  We  begin  by  analyzing  the  effect  of  memory  waste  on  overall 
performance. 

Figure  22  shows  the  curves  of  the  average  turnaround  time  versus 
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the  number  of  memory  modules  for  all  three  schemes,  and  Figure  23  shows 
their  corresponding  total  memory  bandwidth  curves.  Notice  that  the  total 
amount  of  memory  does  not  change,  m  is  simply  the  number  of  modules  into 
which  this  total  is  divided.  The  solid  lines  represent  the  monoprogramming 
curves  and  the  dotted  lines  represent  the  multiprogramming  curves.  Only 
five  curves  are   shown  in  both  figures;  the  two  curves  for  the  distributed 
scheme  are  yery   close  to  each  other  and  only  one  is  shown. 

As  we  can  see,  m  has  a  great  influence  on  both  partitioned  and 
mixed  schemes,  especially  from  8  to  16.  For  m=8,  the  turnaround  time  of 
the  partitioned  scheme  is  almost  ten  times  as  large  as  that  of  the  distri- 
buted scheme.  This  is  very  easy  to  understand.  Since  with  so  few  memory 
modules,  a  large  portion  of  the  memory  can  be  easily  wasted  and  not  too 
many  jobs  can  be  in  the  memory  at  the  same  time.  For  example,  a  job 
requiring  130K  bytes  memory  will  occupy  two  modules  since  the  module  size 
is  128K  bytes,  and  almost  one-eighth  of  the  useful  memory  has  been  wasted 
by  this  job.  Therefore,  a  job  in  general  has  to  spend  a  lot  of  time  in 
the  waiting  queue  before  it  finally  has  enough  memory  modules  to  enter 
the  memory. 

Table  7  shows  the  total  memory  bandwidth  (B  ),  the  average  job 

bandwidth  (b  )  the  memory  utilization  (U  ),  the  average  number  of  jobs 
v  w  m 

in  memory  (n) ,  and  the  queueing  time  of  each'  job  (q)  for  these  three  alloca- 
tion schemes.  The  average  job  bandwidth  is  the  average  memory  bandwidth 
each  job  can  get  while  it  is  being  executed.  It  depends  on  how  we  allocate 
the  memory,  how  we  interleave  the  program,  and  the  speed  ratio  s  of  memory 
and  processor.  We  have  described  how  we  calculate  the  memory  bandwidth 
in  the  last  chapter.  Since  a  processor  can  only  generate  up  to  s  memory 
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Figure  22.  Average  Turnaround  Times  of  Three  Memory  Allocation  Schemes 
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Figure  23.  Total  Memory  Bandwidths  of  Three  Memory 

Allocation  Schemes.  (  For  System  in  Figure  22  ) 
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0.58 
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0.62 
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41 

25 
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5.57 

6.50 

6.54 
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1.10 

1.53 
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1.98 
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U 
m 

0.79 

0.68 

0.63 
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n 
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445 

41 

26 
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6.50 

6.64 

6.78 
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Bw 

1.46 

2.57 

2.81 

3.18 

Distributed 

um 

0.80 

0.62 

0.60 

0.57 

n 

7.2 

5.5 

5.3 

5.1 

q 

58 

15 

12 

10 

Table  7.  The  Total  Memory  Bandwidth  (B  ),  Average  Job  Bandwidth 
(B  ),  Memory  Utilization  (U  ),  Average  Number  of  Jobs 
in  Memory  (n),  and  Average  Queueing  Time  (q~)  of  Three 
Memory  Allocation  Schemes.  (  For  the  Monoprogrammed 
System  of  Figure  22  ) 
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requests  per  memory  cycle,  B,,  is  upper  bounded  by  s.  From  Table  7,  we 

w 

can  see  the  highest  job  bandwidth  we  obtain  is  3.18,  where  we  use  the 
distributed  scheme,  32  memory  modules,  and  a  speed  ratio  of  4. 

The  job  bandwidths  for  the  partitioned  scheme  and  the  mixed  scheme 
are  very  similar.  Basically,  it  is  because  the  ways  these  two  schemes 
allocate  module  to  a  job  are  quite  similar,  as  we  can  see  in  Figure  5.  A 
job  will  be  stored  in  the  same  number  of  modules  under  both  schemes,  although 
there  might  be  some  module  sharing  in  the  mixed  system.  Thus,  a  job  is  con- 
fined in  a  fixed  number  of  modules  and  can  only  access  these  modules  no 
matter  which  scheme  we  use.  Most  of  all,  most  or  all  of  the  job  is  free 
from  interference  by  other  jobs.  There  is  no  memory  interference  between 
jobs  in  the  partitioned  system,  and  small  interference  in  the  mixed  system 
since  only  the  "overflow"  parts  will  share  a  module.  Therefore,  the  job 
bandwidths  for  these  two  schemes  will  be  quite  close  due  to  these  facts. 
However,  the  mixed  scheme  yields  better  turnaround  times,  since  it  can  use 
the  memory  more  efficiently  and  hence  allow  more  jobs  to  be  processed  at  the 
same  time.  (This  tells  us  that  the  job  bandwidth  alone  cannot  be  used  to 
compare  the  performance  of  two  systems.) 

The  job  bandwidth  of  the  distribured  scheme,  on  the  other  hand, 
is  much  better  than  the  job  bandwidths  of  the  other  two  schemes.  The 
distributed  scheme  will  spread  out  every  job  across  the  whole  memory,  thus 
providing  each  processor  the  potential  of  referencing  every   memory  module. 
Although  all  the  jobs  are  sharing  the  memory  and  the  mutual  interference 
is  large,  a  large  bandwidth  still  can  be  obtained  through  the  large  degree 
of  interleaving.  Later  in  this  section,  we  will  explain  why  the  distributed 
system  can  produce  a  higher  bandwidth  than  the  mixed  system  by  using  a 
numerical  example. 
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The  memory  utilization  in  Table  7  is  defined  to  be  the  per- 
centage of  the  memory  that  is  actually  used  by  jobs.  For  the  mixed  scheme 
and  the  distributed  scheme,  there  will  be  no  problem  since  a  job  will  be 
allocated  exactly  the  amount  of  memory  it  asks  for.  As  for  the  par- 
titioned scheme,  it  is  a  little  bit  more  complicated  since  the  memory  is 
allocated  by  the  module,  and  in  general  a  job  will  get  more  memory  than 
it  really  needs.  So,  there  are  two  different  memory  utilizations  we  should 
distinguish.  One  is  the  utilization  we  defined  above,  which  is  to  calculate 
the  percentage  of  the  memory  really  used  by  the  jobs.  The  other  one  is 
the  percentage  of  the  memory  that  is  occupied  by  the  jobs.  Of  course,  the 
latter  is  larger  than  the  former  since  some  memory  will  be  occupied  but 
not  be  used.  In  other  words,  some  memory  is  wasted  under  the  partitioned 
scheme. 

In  order  to  distinguish  these  two  types  of  memory  utilization, 
we  will  call  the  first  one  "word  memory  utilization"  and  the  second 

one  "module  memory  utilization."  Of  course,  both  types  of  memory  utiliza- 
tion will  be  the  same  in  a  mixed  system  or  a  distributed  system,  i.e.,  the 
word  memory  utilization.  The  utilizations  we  show  in  Table  7  are   all  word 
memory  utilizations.  Most  of  the  time  we  will  just  call  this  memory  utiliza- 
tion for  short. 

One  interesting  thing  is  that  the  difference  of  these  two  memory 
utilizations  is  the  percentage  of  the  memory  which  has  been  wasted,  i.e., 
occupied  but  unused.  This  is  yery   easy  to  understand.  We  will  show  some 
results  of  the  memory  waste  of  the  partitioned  scheme  later. 

As  we  can  see  from  Table  7,  for  m=8  and  under  monoprogramming, 
the  partitioned  scheme  has  a  total  memory  bandwidth  of  4.47,  a  job  bandwidth 


106 


of  1.15,  a  (word)  memory  utilization  of  only  58%,  and  averages  5.5  jobs  in 
the  memory,  which  results  in  an  average  queueing  time  of  1343  seconds! 
Under  the  same  condition,  the  distributed  system  has  a  total  bandwidth  of 
6.50,  a  job  bandwidth  of  1.46,  a  memory  utilization  of  80%,  averages  7.2 
jobs,  and  has  an  average  queueing  time  of  only  58  seconds. 

Of  course,  one  way  to  improve  the  performance  of  the  partitioned 
system  is  to  increase  the  number  of  memory  modules  and  to  decrease  the  size 
of  each  module.  This  can  reduce  the  amount  of  wasted  memory,  since  on 
the  average  each  job  will  waste  one-half  of  a  module  (see  the  proof  in 
Chapter  2).  Thus,  the  probability  that  a  job  gets  blocked  due  to  insufficient 
memory  will  be  reduced.   In  Table  8,  we  show  the  word  utilization,  the 
module  utilization,  and  the  memory  waste  of  the  partitioned  system  in 
Figure  22.  As  we  can  see,  when  we  double  the  number  of  modules  from  8  to 
16,  the  word  memory  utilization  of  the  partitioned  system  increases  to  67%, 
and  the  module  memory  utilization  drops  a  little  down  to  92%.  Meanwhile, 
the  memory  waste  has  been  reduced  from  37%  to  25%.  This  is  why  the  average 
queueing  time  reduces  sharply  to  112  seconds  (a  gain  of  12)!  If  we  further 
increase  the  number  of  modules  to  24,  the  memory  waste  will  decrease  to 
14%,  and  the  average  queueing  time-  down  to  41  seconds.  Apparently,  the 
partitioned  system  is  very   sensitive  to  the  number  of  modules.  The  main 
reason  is,  of  course,  the  memory  waste  will  reduce  the  ability  of  accepting 
jobs.  Therefore,  it  is  very   important  to  provide  enough  memory  modules 
in  the  partitioned  system. 

Actually,  the  memory  utilization  and  the  average  number  of  jobs 
in  memory  do  not  grow  as  the  system  performance  improves.  On  the  contrary, 
all  the  memory  utilizations  and  the  average  numbers  of  jobs  decrease  when 
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m=8 

m=16 

m=24 

m=32 

Word  Memory  Utilization 

0.58 

0.67 

0.62 

0.60 

Module  Memory  Utilization 

0.95 

0.92 

0.76 

0.71 

Memory  Waste 

0.37 

0.25 

0.14 

0.11 

Table  8.  Memory  Waste  for  the  Partitioned  Scheme 
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we  increase  m,  except  for  the  case  we  just  mentioned.  And  surprisingly, 

the  distributed  scheme  has  the  smallest  values  of  U  and  n   amonq  these 

m         ^ 

schemes,  yet  on  the  other  hand,  it  has  the  best  turnaround  time.  This 

means  that  the  higher  utilization  (as  we  define  it)  does  not  necessarily 

imply  a  better  throughput. 

In  fact,  u  and  n  should  decrease  as  the  system  throughput  in- 
m  3 

creases,  since  if  the  arrival  rate  is  fixed,  the  faster  the  system  operates, 
the  faster  the  jobs  leave,  and  the  emptier  the  system  will  be.*  Especially 
in  a  distributed  system,  the  fewer  jobs  in  the  memory,  the  less  memory  con- 
tention each  job  will  suffer,  and  the  higher  bandwidth  each  processor  will 
get  to  execute  a  job.  The  system  throughput  (the  memory  bandwidth)  goes 
up  when  we  increase  the  number  of  memory  modules.  This  explains  why  U 
and  n  decrease  as  m  increases. 

The  distributed  scheme,  however,  does  not  have  this  memory  utili- 
zation advantage  over  the  mixed  scheme  since  both  the  distributed  and  mixed 
schemes  can  fully  utilize  the  memory.  This  was  shown  in  Figure  5.   In  the 
fist  column  of  Table  7,  the  mixed  scheme  indeed  shows  memory  utilization 
(79%)  and  average  number  of  jobs  in  memory  (7.5)  comparable  to  that  of 
the  distributed  scheme.  Despite  this,  the  distributed  scheme  still  yields 
a  better  turnaround  time,  for  every   m  value.  Apparently,  the  distributed 
scheme  can  produce  higher  bandwidth  than  the  mixed  scheme  can,  if  they  are 
given  the  same  set  of  jobs  in  the  memory.  The  difference  must  come  from 
the  degree  of  interleaving  a  scheme  provides  to  each  job,  since  this  is 
the  only  difference  between  these  two  schemes. 


*This  can  be  explained  by  using  Little's  Theorem,  i.e.,  n  =  Ax,  where  X 
is  the  job  arrival  rate  and  x  is  the  average  turnaround  time. 
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Let  us  look  at  an  example  which  can  explain  why  the  distributed 
scheme  generates  higher  memory  bandwidth  than  the  mixed  scheme  does. 
Assuming  we  have  a  memory  system  of  8  modules  and  four  jobs  of  sizes  1  2/3 
modules,  2  1/2  modules,  1  1/3  modules,  and  2  modules  respectively.  These 
jobs  are  stored  in  the  memory  as  shown  in  Figure  24.  The  numbers  shown 
in  Figure  24-a  are  the  fractions  of  these  jobs  in  each  individual  module. 
We  assume  they  are  the  reference  probabilities  we  need  in  the  general  band- 
width equation,  i.e.,  Equation  (6)  in  the  last  chapter.  Of  course,  for 
the  distributed  system  shown  in  Figure  24-b,  all  the  reference  probabilities 
are  1/8.  Let  us  also  assume  that  the  references  generated  by  a  processor 
are  all  independent,  as  we  do  in  our  simulation.  In  order  to  mimic  the 
real  operation,  we  will  in  addition  assume  job  a  is  doing  I/O  and  hence 
does  not  contribute  to  the  total  bandwidth. 

Now,  for  the  distributed  system,  the  bandwidth  can  be  calculated 

as  follows. 

■  m  p       /    v  . 

B°     =       2  [1    -     n     q)S>]    =    m[i    .    (i    _    i/m)P  S-, 

=      8[1    -   (1   -   1/8)J  q] 
=      8[1   -   (7/8)12] 
=      6.389 

where  s=4  is  the  memory-processor  speed  ratio  which  we  assumed  in  Figure  22. 
As  we  can  see,  s  contributes  a  lot  in  the  above  calculation  since  it 
increases  the  power  of  the  second  term  in  the  parenthesis.  As  for  the 
mixed  system,  the  bandwidth  can  be  calculated  by  first  finding  out  q..  s. 
Here  are  all  the  numerical  values. 
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(4) 
qll   = 

(4) 
q12   = 

q(4)  = 
q23 

(4) 
q24   = 

(4) 
q25   = 

q(4)  ■ 
q35 

(4) 
q36   = 

(4) 
q47  = 

q(4)  = 
q48 


-  2/5) 

-  2/5)' 

-  1/5)' 

-  1/4)' 

-  3/4)' 

-  1/2) 

-  1/2) 


4 


4 


0.1296 
0.1296 
0.4096 
0.3164 
0.0039 
0.0625 
0.0625 


Using  the  same  equation,  we  can  get  the  following  result. 


Bm  =     0  +    0  +  0.8704  +  0.8704  +  0.8704  +  0.9961  +  0.9375  +  0.9375 
w 

=  5.482 


Comparing  these  two  results,  we  can  see  the  distributed  scheme  produces  16.5% 
more  bandwidth  than  the  mixed  scheme. 

In  fact,  if  all  four  jobs  are  active,  i.e.,  all  four  processors 
are  accessing  the  memory,  the  mixed  scheme  can  produce  a  total  bandwidth  of 
7.321,  and  the  distributed  scheme  only  produces  7.055.  However,  the  proba- 
bility that  all  jobs  are  active  is  rather  small,  especially  if  a  job 
spends  a  significant  amount  of  time  in  doing  1/0  like  the  workload  we  are 
using.  Suppose  more  than  one  job  is  doing  1/0,  the  distributed  system 
will  open  an  even  larger  margin  over  the  mixed  system,  since  more  modules 
will  idle  in  the  mixed  system.  Therefore,  the  distributed  system  wins 
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by  a  large  margin  most  of  the  time.  This  explains  why  the  turnaround 
time  of  the  distributed  system  is  the  lowest  in  these  schemes. 

In  summary,  the  performance  difference  between  the  partitioned 
system  and  the  mixed  system  is  caused  by  the  job  packing,  and  that  between 
the  mixed  system  and  the  distributed  system  is  caused  by  the  job  bandwidth. 
This  has  been  carefully  explained  above  and  can  be  seen  in  Table  7. 

Recall  Figure  22,  we  can  see  all  the  curves  are  pulling  together 
as  m  gets  larger.  For  m=32,  the  monoprogramming  curve  of  the  partitioned 
system  is  only  35%  (28/80)  higher  than  the  multiprogramming  curve  of  the 
distributed  system.  This  shows  that  the  partitioned  system  is  comparable 
to  the  distributed  system  if  we  can  afford  a  large  number  of  modules. 

Actually,  one  very   important  factor  that  makes  the  distributed 

scheme  better  than  any  other  scheme  is  the  memory-processor  speed  ratio  s. 

The  second  term  in  the  parenthesis  of  the  equation  for  B  will  diminish 

w 

very   fast  as  s  gets  larger,  which  makes  the  bandwidth  approach  m  (the 
perfect  bandwidth)  very   quickly.  On  the  other  hand,  the  distributed 
system  will  lose  its  superiority  as  s  gets  smaller.  Table  9  shows  the  turn- 
around time  versus  m  value  for  s=2.  This  table  is  the  same  representation 
as  Figure  22,  except  we  are  emphasizing  the  numerical  values  this  time. 
As  we  can  see,  the  monoprogramming  result  of  the  partitioned  system  pulls 
within  14%  when  m=24. 

In  the  limiting  case  when  s=l ,  the  partitioned  system  will  have 
the  best  bandwidth  if  we  assume  the  same  number  of  jobs  in  the  memory. 
This  is  because  the  partitioned  system  does  not  have  any  memory  interference 
between  jobs,  but  the  other  two  have.  The  only  reason  the  distributed  or 
mixed  system  can  win  is  that  they  can  fully  utilize  the  memory,  and  hence 
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Monoprogramming,  Multiprogramming 


System 

m=8 

m=16 

m=24 

m=32 

Partitioned 

186,191 

91,  92 

83,  83 

80,  81 

Mixed 

99,  95 

83,  82 

80,  79 

78,  78 

Distributed 

81,  82 

73,  74 

73,  73 

71,  72 

Table  9.  Ta  Versus  m  for  Three  Memory  Allocation  Schemes 
(  s=2  ) 
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the  probability  that  a  job  cannot  enter  due  to  insufficient  memory  is  the 
smallest.  Our  simulation  result  shows  that  the  turnaround  time  of  the 
partitioned  system  is  higher  than  the  distributed  system  by  only  15%  (78/68) 
when  m=16,  and  by  only  7%  (72/67)  when  m=24.  Therefore,  the  partitioned 
system  does  perform  very   well  when  s  is  small. 

Currently,  the  memory  technology  can  provide  us  a  semiconductor 
with  a  cycle  time  of  less  than  100  ns.  If  we  use  a  microprocessor  with  a 
similar  cycle  time,  then  a  s  value  of  1  is  realizable.  Hence,  the  use  of 
the  partitioned  scheme  is  indeed  wery   favorable  since  it  has  so  many  ad- 
vantages as  we  described  in  Chapter  1,  and  yet  it  performs  equally  as  well 
as  any  other  scheme. 

We  compared  the  performances  of  monoprogramming  and  multiprogram- 
ming in  the  last  section,  and  we  claimed  that  monoprogramming  is  yery 
comparable  with  multiprogramming  when  we  have  enough  processors.  Both 
Figure  22  and  Table  9  again  give  strong  support  for  this  fact.  As  a  mat- 
ter of  fact,  some  results  in  Table  9  even  show  that  monoprogramming  is 
slightly  better  than  multiprogramming! 

When  we  decrease  the  memory-processor  ratio  s,  it  could  mean  we 
use  either  slower  processor  or  faster  memory.  In  our  simulation,  we  hold 
the  processor  speed  constant.  So,  changing  s  from  4  to  2  means  reducing 
the  memory  cycle  time  by  half.  From  Figure  22  and  Table  9,  we  can  see 
s  has  a  significant  effect  on  the  partitioned  scheme,  particularly  for 
small  m.  When  we  reduce  s  from  4  to  2,  the  turnaround  time  improves  a 
lot  ranging  from  800%  to  35%.  Of  course,  we  have  to  pay  the  price  of 
faster  memory.  We  will  come  back  on  this  subject  later  in  this  chapter. 

Although  there  are  several  advantages  of  using  the  partitioned 
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scheme  (for  example,  it  is  very   reliable  and  easy  to  expand),  there  are 
some  implementation  problems,  e.g.,  address  mapping.  Let  us  give  a  simple 
example  to  explain  this  problem. 

Suppose  a  job  requires  three  modules  of  memory.  How  do  we  store 
this  job  in  the  memory  when  it  gets  these  modules?  We  can  either  three- 
way  interleave  this  job  or  store  it  in  a  sequential  manner,  i.e.,  store  the 
first  1/3  of  the  program  in  the  first  module,  the  second  1/3  in  the  second 
module,  and  the  rest  in  the  last  module.  This  second  method  does  not  create 
any  particular  addressing  problem,  since  the  instructions  and  data  are 
stored  sequentially  inside  a  module  and  the  ordinary  address  generation 
mechanism  can  be  used  to  produce  physical  addresses.  So,  as  long  as  we 
know  which  three  modules  contain  this  job,  we  will  have  no  problem  fetching 
or  storing  in  these  modules.  Of  course,  the  module  size  in  general  will 
be  a  power  of  2,  which  makes  the  address  mapping  extremely  easy.  However, 
this  scheme  may  not  allow  us  to  take  the  advantage  of  the  independent 
memory  modules,  since  if  we  assume  a  serial  address  stream,  only  one  word 
will  be  accessed  at  a  time.  This  implies  we  can  only  get  a  minimum  band- 
width of  1!  For  s  >  1,  this  scheme  apparently  will  waste  the  processing 
power  due  to  insufficient  memory  bandwidth. 

In  order  to  get  a  higher  bandwidth,  we  should  use  address  inter- 
leaving, which  we  assume  in  our  simulator.  If  we  use  this  scheme,  we  need 
a  more  complicated  mapping  mechanism  to  generate  the  physical  addresses, 
since  now  the  consecutive  instructions  or  data  are  stored  in  different 
modules.  The  difficulties  are:  the  degree  of  interleaving  is  variable 
depending  on  the  job  size,  and  the  modules  allocated  to  the  same  job  might 
not  be  adjacent  to  each  other.  Hence,  we  cannot  get  the  next  physical 
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address  by  simply  incrementing  the  current  program  counter  by  one,  as 
we  can  do  in  the  above  non-interleaving  scheme  or  in  the  distributed  sys- 
tem. Of  course,  indirect  addressing  and  the  branching  instructions  are 
even  more  difficult  to  solve.  Therefore,  we  do  need  some  extra  hardware 
and  an  address  generating  algorithm  in  the  instruction  unit  if  we  want  to 
use  the  interleaving  scheme  in  a  partitioned  system. 

We  are  interested  in  the  hardware  design  of  this  problem.  In 
the  next  chapter,  we  will  discuss  a  few  feasible  methods  which  involve  the 
use  of  quotient-remainder  operation. 

From  Figure  22  and  Table  9,  we  can  see  the  mixed  scheme  only 
out-performs  the  partitioned  scheme  by  a  small  margin.  However,  it  is 
less  reliable  since  the  failure  of  a  shared  module  might  affect  several 
jobs.  Moreoever,  it  needs  a  more  complicated  operating  system  which  is 
one  thing  we  are  against.  So,  we  feel  that  the  partitioned  scheme  is  a 
very   good  choice. 

In  the  example  of  Figure  24,  we  assume  the  probability  that  a 
processor  accesses  a  certain  module  to  be  the  fraction  of  the  program  that 
is  stored  in  that  module.  This  assumption  obviously  is  only  valid  for  the 
random  access  system,  that  is,  a  processor  which  generates  references  to 
the  memory  modules  in  a  random  fashion.  In  other  words,  there  is  no  re- 
lationship between  any  two  successive  references  generated  by  a  processor 
and  the  second  reference  has  the  same  chance  to  refer  to  any  module  occupied 
by  this  program  independent  of  the  first  one. 

Of  course,  this  assumption  about  random  addressing  is  not  neces- 
sarily valid.  It  is  well  known  that  address  streams  tend  to  be  somewhat 
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serial.  A  serial  address  stream  will  produce  better  performance  than 
a  random  address  stream  if  the  addresses  are  interleaved  across  several 
modules,  and  worst  performance  if  the  addresses  run  vertically  in  each 
module.  Unfortunately,  it  is  difficult  to  adequately  quantify  this  serial ity 
or  to  determine  a  typical  value  of  seriality.  This  has  forced  us  to  use 
the  random  addressing  assumption  in  our  simulation. 

Now,  let  us  find  out  how  reasonable  our  assumption  will  be.  Let 
us  look  at  some  performance  bounds  to  see  how  good  and  how  bad  our  system 
will  perform  if  we  assume  the  perfect  and  the  worst  memory  bandwidth  cases. 
In  the  perfect  memory  bandwidth  case,  we  will  assume  there  is  no  memory 
conflict  between  processors,  so  each  processor  will  be  getting  a  maximum 
possible  bandwidth  of  s.  However,  in  a  distributed  system,  if  the  number  of 
active  processors  (n  )  times  s  is  greater  than  the  total  number  of  memory 
modules  (m) ,  we  will  assume  each  processor  is  only  getting  a  bandwidth  of 
m/n  (<  s).  In  a  mixed  or  partitioned  system  if  s  is  greater  than  the  num- 

r 

ber  of  modules  assigned  to  a  job  m-,  then  the  processor  will  only  get  a 
bandwidth  of  m..  Of  course,  we  will  then  get  the  best  possible  performance 
among  all  the  systems  that  have  the  same  set  of  system  parameters,  i.e.,  a 
lower  bound  of  the  turnaround.  This  case  can  only  happen  if  we  horizontally 
interleave  e\jery   program  across  the  memory  modules,  and  assume  a  perfect  con- 
dition of  no  memory  conflict. 

On  the  other  hand,  for  the  worst  memory  bandwidth,  we  will  assume 
each  active  processor  is  only  getting  a  bandwidth  of  1,  which  corresponds 
to  the  situation  when  we  vertically  store  each  program  inside  a  memory  module 
(no  interleaving  at  all)  and  unluckily  no  two  references  will  ever  go  to 
two  modules.  Of  course,  this  case  might  not  happen  but  it  does  give  us  the 


118 


worst  performance  which  can  serve  as  an  upper  bound  on  the  turnaround  time. 

Figure  25  repeats  the  curves  of  Figure  22  together  with  four 
curves  which  represent  the  performance  of  the  perfect  and  worst  memory  band- 
width cases.  We  use  monoprogrammed  systems  to  derive  the  two  upper  bound 
curves  since  they  contain  fewer  jobs.  On  the  other  hand,  we  use  multi- 
programmed  systems  to  obtain  the  two  lower  bound  curves  since  they  can  con- 
tain more  jobs.  As  we  can  see,  the  monoprogrammed,  partitioned  system 
yields  the  worst  result  which  we  call  the  largest  upper  bound,  and  the  multi- 
programmed,  distributed  system  yields  the  best  result  which  we  call  the  smal- 
lest lower  bound.  Any  performance  curve  will  be  bounded  between  these  two 
curves  no  matter  what  the  memory  allocation  scheme  we  use  and  how  we  as- 
sume the  reference  probabilities,  as  long  as  we  are  using  eight  processors, 
the  shortest-job-first  algorithm,  an  average  I/O  speed  of  42  ms,  1024K  bytes 
main  memory,  a  full  connection,  and  a  speed  ratio  of  4  (cf.  Figure  22). 

One  interesting  thing  is  that  when  m  >_  16,  all  the  curves  are 
clustered  above  the  lower  bound  curve.  Obviously,  the  random  distribution 
assumption  already  gives  us  pretty  good  results.  Further  complication  of  the 
memory  bandwidth  calculation  apparently  can  only  give  a  very   minor  effect 
on  these  performance  curves.  This  explains  why  we  are  using  the  random 
distribution  assumption  in  all  our  simulations.  Furthermore,  as  we  can 
see,  all  the  curves  are  far  below  the  upper  bound  curves.  This  tells  us 
that  the  (horizontal)  interleaving  scheme  can  be  a  very   important  factor  on 
the  system  performance. 

Now,  let  us  briefly  summarize  the  results  we  have  in  this  section. 
Table  10  shows  an  overall  comparison  of  these  three  memory  allocation  schemes. 
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^v.   Scheme 
Parameter^-. 

Partitioned 

Mi  xed 

Distributed 

Total  Memory 
Bandwidth 

Moderate 

Moderate 

High 

Job  Memory 
Bandwidth 

Moderate 

Moderate 

High 

(Word)  Memory 
Utilization 

Bad 

Good 

Good 

Reliabil ity 

Good 

Moderate 

Bad 

Turnaround 
Time 

Bad  for  Small  m 
Good  for  Large  m 

Good 

Best 

Memory  Waste 

Yes 

No 

No 

Table  10.  Summary  of  the  Performance  of  Three  Memory 
Allocation  Schemes. 


121 


The  distributed  scheme  leads  in  all  items  except  the  reliability.  On  the 
other  hand,  the  partitioned  scheme  trails  in  all  items,  except  it  is  the 
most  reliable  one.  However,  we  have  shown  that  when  we  increase  the  number 
of  memory  modules  in  the  system  or  improve  the  memory  speed,  the  performance 
of  the  partitioned  scheme  improves  very   quickly.  For  a  moderately  large 
number  of  modules,  say  24,  the  partitioned  system  already  has  a  performance 
which  is  very   comparable  to  the  distributed  system.  Besides,  the  partitioned 
scheme  provides  a  very  high  reliability.  In  a  system  where  the  reliability 
is  extremely  important,  the  partitioned  scheme  should  be  considered  first. 
If  we  need  a  higher  performance  and  can  sacrifice  a  little  bit  of  reliability, 
then  the  mixed  scheme  might  be  a  better  choice.  Of  course,  if  we  are  primarily 
interested  in  performance  and  have  very   reliable  hardware,  i.e..  the  mean 
time  between  faults  (MTBF)  is  long,  then  the  distributed  scheme  will  be  the 
best  candidate. 

3.1.4  Job  Scheduling  Algorithm 

One  of  the  major  factors  that  affect  the  turnaround  time  of  a  job 
is  the  scheduling  algorithm.  The  scheduling  algorithm  is  used  to  determine 
the  order  that  the  jobs  in  the  waiting  queue  will  enter  the  system.  This 
is  based  on  some  attribute  of  these  jobs,  not  necessarily  the  order  they 
arrive.  Therefore,  some  new  jobs  might  enter  the  memory  and  get  executed 
before  a  job  which  arrived  previously.  This,  of  course,  increases  the  time 
this  old  job  has  to  spend  in  the  waiting  queue.  But  on  the  other  hand,  those 
jobs  which  get  into  the  memory  earlier  will  suffer  shorter  queueing  dealays. 

In  fact,  the  purpose  of  using  a  scheduling  algorithm  is  to  some- 
how scramble  the  execution  order  of  a  certain  set  of  jobs  such  that  some 
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kind  of  performance  will  be  improved.  Most  of  the  time,  the  average  turna- 
round time  or  the  average  queueing  time  will  be  what  people  are  trying  to 
improve. 

The  queueing  time  of  scheduling  algorithms  has  been  extensively 
studied  by  queueing  theorists.  In  [26],  Kleinrock  has  a  \/ery   complete  dis- 
cussion on  this  subject.  Most  of  the  analytic  results  are  expressed  in 
terms  of  the  average  conditional  queueing  time,  i.e.,  the  queueing  time  of  a 
job  which  needs  a  certain  amount  of  processing  time.  For  example,  Figure  26 
shows  the  average  conditional  queueing  time  curves  of  three  commonly  used 
scheduling  algorithms  in  time-sharing  systems,  namely  FCFS  (f irst-come-first- 
serve),  RR  (round-robin),  and  FB  (foreground-background),  assuming 
the  system  is  an  M/M/l  queue.  We  do  not  show  the  scales  since  they 
depend  on  the  arrival  rate,  the  mean  service  time  and  the  service- 
time  distribution.  Only  the  shape  of  a  curve  is  shown  which  gives 
the  effect  of  a  scheduling  algorithm  on  the  jobs  with  different  pro- 
cessing time  requirements. 

As  we  can  see,  the  average  conditional  queueing  time  for  FCFS  is 
the  same  (constant)  for  any  job,  whether  it  requires  long  processing  time 
or  short  processing  time.  This  type  of  scheduling  algorithm  is  called  non- 
discriminating. FCFS  can  give  the  shortest  queueing  time  to  long  jobs. 
However,  it  gives  the  longest  queueing  time  to  short  jobs.  The  average  con- 
ditional queueing  time  for  RR,  on  the  other  hand,  grows  linearly  as  the 
processing  time  increases.  The  longer  the  job  is,  the  larger  the  queueing 
time  will  be.  This  kind  of  scheduling  algorithm  is  called  linear-discriminating 
It  discriminates  against  long  jobs.  Similarly,  FB  is  called  the  most- 
discriminating  since  it  gives  the  longest  queueing  time  to  long  jobs  among 
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Figure  26.  Average  Conditional  Queueing  Time  for  M/M/l  System. 


124 


all  known  scheduling  algorithms.  But,  FB  yields  the  shortest  queueing  time 
to  short  jobs.  RR  is  a  very   popular  scheme  in  a  lot  of  time-sharing  systems. 
FB  is  used  in  the  famous  MULTICS  system. 

There  is  no  absolute  standard  of  which  algorithm  is  the  best  among 
these  three.  All  depend  on  what  kind  of  measurement  we  are  most  interested 
in  and  the  job  mix  we  are  dealing  with.  For  example,  if  we  are  interested 
in  the  overall  average  queueing  time  and  most  of  the  jobs  are  short  jobs, 
then  apparently  we  should  go  for  FB.  However,  since  we  are  dealing  with 
batch  systems,  we  will  not  use  these  algorithms. 

In  our  study,  we  will  use  the  average  turnaround  time  (Ta)  of  all 
jobs  as  our  measurement  of  goodness.  This  is  the  same  as  using  the  overall 
average  queueing  time  (q),  since  a  longer  q  always  implies  a  longer  Ta.  We 
will  show  both  of  the  measures  later. 

The  following  eight  scheduling  algorithms  will  be  studied  here. 

1.  FCFS:  f irst-come-first-serve 


2.  SJF 

3.  LJF 

4.  SMF 

5.  LMF 

6.  SMNF 

7.  SPTF 

8.  BMFF 


shortest- job- first 

longest-job-first 

small  est- job-first 

largest-memory-first 

smallest-magic-number-first 

shortest- processing- time-first 

best-memory- fit-first 


All  these  names  are  self-explanatory.  SMNF  is  a  scheme  used  in  our  IBM 
360/75  system.  Each  job  is  assigned  a  magic  number  which  is  calculated  by 
using  the  following  formula: 

MN  =  3*(processing  time)  +  0.01 (job  size)  +  0.05*(Number  of  I/O  requests) 


Then,  the  job  with  the  smallest  magic  number  will  get  executed  first.  This 
scheme  not  only  penalizes  long  jobs  but  also  penalizes  large  jobs  (jobs 
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requiring  large  memory  spaces),  since  the  above  formula  takes  both  processing 
time  and  job  size  into  account.  BMFF  will  choose  the  job  which  can  fit  into 
the  memory  best.   If  the  available  memory  space  is  very  large,  then  BMFF 
will  act  just  like  LMF  since  the  largest  job  in  the  waiting  queue  will  be 
chosen.  However,  if  the  remaining  space  cannot  hold  the  largest  job,  some 
smaller  job  which  fits  best  will  be  chosen  instead.  SPTF  will  choose  the 
job  with  the  shortest  processing  time.  It  is  slightly  different  from  SJF, 
since  SJF  also  takes  I/O  time  into  account.  In  other  words,  SJF  will  choose 
the  job  which  has  the  smallest  CPU  plus  I/O  time. 

When  a  job  arrives,  it  will  be  placed  somewhere  in  the  waiting 
queue  according  to  the  scheduling  algorithm.  For  example,  SMNF  will  line 
up  the  jobs  such  that  their  magic  numbers  are  in  increasing  order.  Then, 
the  queue  will  be  considered  from  the  beginning  every   time. 

Table  11 -a  shows  Ta  versus  p  values  for  all  these  algorithms,  and 
Table  11-b  displays  the  numerical  values  of  the  average  queueing  time  (q). 
As  we  can  see,  SJF  gives  the  smallest  turnaround  time  among  all  these 
scheduling  algorithms.  This  is  what  we  would  expect  since  it  has  been 
proven  analytically  [26].  Therefore,  we  use  SJF  in  all  other  discussions. 

In  fact,  for  p  >  6  all  these  algorithms  perform  more  or  less  the 
same.  For  example,  FCFS  is  less  than  10%  from  SJF.  This  means  when  we  have 
enough  hardware  the  scheduling  algorithm  really  does  not  make  too  much  dif- 
ference on  the  performance.  It  is  very   easy  to  understand,  since  when  the 
throughput  is  high  the  system  will  be  lightly  loaded,  and  no  matter  how  we 
schedule  the  job  each  job  will  only  suffer  very   little  delay.  Only  when 
the  system  is  heavily  loaded  will  the  scheduling  algorithm  be  important. 
Maybe  the  following  adaptive  method  can  be  used.  If  the  system  is  lightly 
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Table  11.  Comparison  of  Eight  Different  Scheduling  Algorithms 
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loaded,  all  the  jobs  will  be  served  according  to  their  arriving  order  and  no 
scheduling  algorithm  will  be  used.  When  the  system  load  exceeds  a  certain 
threshold,  a  scheduling  algorithm,  e.g.,  SJF,  will  become  effective  to 
schedule  the  jobs  waiting  in  the  queue. 

In  Figure  20,  we  have  shown  the  sensitivity  of  the  turnaround  time 
when  we  increase  the  number  of  processors.  Table  11  also  shows  the  same 
phenomenon.  Moreover,  the  drop  is  even  sharper  here.  This  again  shows  the 
importance  of  having  enough  processors  in  the  system. 

As  we  said  at  the  beginning  of  this  section,  the  scheduling 
algorithm  is  used  to  determine  the  order  that  the  waiting  jobs  will  be 
considered  for  entering  the  memory.  In  general,  all  the  jobs  will  be  lined 
up  in  the  queue  according  to  the  scheduling  algorithm.  Hence,  the  job 
at  the  head  of  the  queue  will  always  be  considered  first.  Of  course,  this 
job  might  not  be  able  to  get  into  the  memory  when  we  are  looking  for  a  job 
to  be  executed.  A  new  problem  arises  here,  that  is,  shall  we  skip  this  job 
and  consider  the  next  one?  This  is  the  so-called  "look-ahead"  problem. 

Apparently,  there  is  no  reason  we  should  not  consider  the  second 
job  if  the  first  job  gets  blocked  due  to  lack  of  memory,  since  we  can 
shorten  the  queueing  time  of  the  second  job  if  it  can  fit  into  the  memory. 
So,  it  is  conceivable  we  might  improve  the  average  turnaround  by  doing 
look-ahead. 

Naturally,  the  next  question  is,  if  we  allow  looking  ahead,  do  we 
consider  the  third  job  if  the  second  one  still  cannot  enter  the  memory.  In 
other  words,  do  we  allow  look-ahead  to  be  carried  onto  the  third  job.  This 
is  usually  called  the  "look-ahead  distance"  problem.  The  look-ahead  distance 
is  defined  to  be  the  maximum  number  of  jobs  we  can  look  at  down  the  queue. 
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For  example,  if  the  look-ahead  distance  is  2,  then  we  can  at  most  look  at 
the  first  two  jobs  and  cannot  look  beyond  the  second  job.  Therefore,  the 
look-ahead  distance  of  one  is  equivalent  to  the  non-look-ahead. 

It  is  true  that  the  chance  of  finding  a  "fittable"  job  is  better 
if  we  allow  longer  look-ahead  distance.  But,  it  does  not  necessarily  mean 
we  can  get  a  better  average  turnaround  time  by  increasing  the  look-ahead 
distance,  since  the  original  order  set  up  by  the  scheduling  algorithm  will 
be  perturbed  by  the  look-ahead  scheme,  and  the  longer  the  distance  is  the 
more  this  order  will  be  perturbed.  In  other  words,  the  look-ahead  distance 
will  cancel  the  effect  of  the  scheduling  algorithm.  So,  a  large  look-ahead 
distance  might  not  be  desirable. 

Figure  27  shows  the  effect  of  look-ahead.  We  can  see  when  we  allow 
a  moderate  look-ahead  distance,  say  4,  we  do  gain  some  benefit.  However, 
a  larger  look-ahead  distance  might  even  cause  a  negative  effect!  Therefore, 
we  suggest  a  look-ahead  scheme  with  a  moderate  distance. 

3.1.5  Effects  of  Job  Characteristics 

In  this  section,  we  are  going  to  study  the  effects  of  job  character- 
istics. As  we  mentioned  earlier,  each  job  is  characterized  by  four  parameters, 
namely,  the  arrival  time,  the  CPU  time,  the  job  size,  and  the  number  of  I/O 
requests.  Needless  to  say,  everyone  of  them  will  affect  the  system  perform- 
ance. Our  purpose  here  is  to  find  out  how  sensitive  the  effect  of  each 
parameter  is. 

Let  us  first  look  at  the  arrival  time.  The  arrival  time  determines 
how  fast  the  job  stream  puts  work  on  the  system.  Of  course,  the  faster  the 
jobs  arrive,  the  heavier  the  workload  will  be.  Since  the  processing  power 
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Figure  27.  The  Effect  of  Look-Ahead  Scheme 
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and  the  memory  space  are  limited,  more  jobs  will  be  accumulated  in  the 
waiting  queue  if  the  jobs  come  in  faster,  and  thus  each  job  will  suffer 
longer  delay.  Apparently,  the  average  turnaround  time  should  go  up  as  we 
increase  the  arrival  rate. 

In  order  to  change  the  arrival  rate,  we  will  multiply  the  arrival 
time  of  each  job  by  a  variable  called  the  "Arrival  Scaling  Factor."  We  can 
speed  up  the  arrival  rate  by  using  a  smaller  arrival  scaling  factor,  since 
the  arrival  time  of  each  job  will  then  be  scaled  down  to  a  smaller  value. 

Figure  28  shows  how  the  average  turnaround  time  responds  when  we 
change  the  arrival  scaling  factor.  As  we  can  see,  when  we  decrease  the 
scaling  factor  from  1.0  down  to  0.3,  the  average  turnaround  time  does  not 
change  very   much.  Apparently,  the  system  is  unsaturated,  or  lightly  loaded, 
within  this  range.  In  other  words,  the  jobs  do  not  arrive  as  fast  as  the 
processors  can  process.  This  does  not  mean  our  IBM  360/75  system  is  under- 
loaded, even  though  the  job  characteristics  are  obtained  from  analyzing 
the  real  workload  on  this  machine.  It  is  because  we  are  using  more  proces- 
sors in  our  model,  and  hence  our  system  has  a  higher  processing  power. 

However,  when  we  further  decrease  the  arrival  scaling  factor,  Ta 
starts  going  up.  Especially  beyond  0.1,  Ta  increases  very   sharply.  Obvi- 
ously, the  system  starts  getting  saturated  at  around  0.1.  Heavier  loading 
will  push  the  system  into  oversaturation,  and  the  average  turnaround  time 
starts  to  blow  up. 

When  the  jobs  arrive  faster,  it  is  conceivable  the  system  will 
become  busier.  This  is  because  the  possible  idle  periods  will  become  smal- 
ler. Figure  29  shows  how  the  percentage  of  the  time  the  system  is  busy 
increases  as  we  decrease  the  arrival  scaling  factor.  By  busy,  we  mean  at 
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least  one  job  is  in  the  memory,  no  matter  whether  it  is  doing  I/O  or  being 
processed  by  a  processor.  At  0.08,  the  system  is  busy  for  more  than  95%  of 
the  time.  This  proves  that  the  system  is  indeed  getting  saturated  when  we 
decrease  the  scaling  factor  below  0.1. 

When  the  system  is  in  saturation,  measures  of  throughput  or 
turnaround  time  might  not  reflect  the  true  effects  of  some  system  parameters, 
and  similarly  when  the  system  is  far  below  saturation.  Therefore, 
we  scale  the  arrival  time  by  a  factor  of  0.1  in  all  our  simulations.  This 
places  the  system  in  an  interesting  region. 

Now,  let  us  look  at  the  effect  of  the  job  size  distribution.  As 
we  said  earlier,  one  major  reason  that  a  job  gets  blocked  from  being  executed 
is  due  to  the  lack  of  memory.  If  we  fix  the  memory  size,  apparently  the 
job  size  will  have  a  big  impact  on  the  system  performance.  Of  course,  the 
larger  the  job  sizes  are,  the  smaller  the  average  number  of  jobs  the  memory 
can  contain  is,  and  the  more  frequently  a  job  will  get  blocked.  Therefore, 
we  should  expect  to  get  a  longer  average  turnaround  time  if  we  use  a  job 
mix  which  has  a  larger  job  size  distribution. 

Figure  30  shows  the  effect  of  the  job  size  on  the  average  job  turn- 
around time.  We  fix  all  other  parameters  and  change  the  job  size  by  multi- 
plying the  size  of  each  job  with  a  Job  Size  Scaling  Factor.  This  is  similar 
to  what  we  have  done  on  the  job  arrival  time.  Thus,  if  the  scaling  factor 
is  2,  the  size  of  each  job  will  be  doubled.  We  can  see  the  turnaround  time 
is  very   sensitive  to  the  change  of  the  job  size.  When  we  double  the  scaling 
factor  from  1.0  to  2.0,  the  average  turnaround  time  increases  by  almost  150% 
(from  83  to  209).  Again,  Ta  doubles  when  we  increase  the  scaling  factor 
from  2.0  to  2.5.  Apparently,  Ta  grows  exponentially  when  we  increase  the 
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job  size.  This  indicates  the  importance  of  having  enough  memory. 

Table  12  shows  the  corresponding  memory  utilization.  Just  as 
we  have  expected,  the  memory  utilization  increases  when  the  job  size  goes 
up.  This  again  tells  us  the  memory  utilization  should  not  be  used  as  an 
indication  of  the  performance. 

The  limited  size  of  memory  is  usually  the  bottleneck  in  a  computer 
system.  Figure  30  shows  that  this  is  especially  true  in  a  multiprocessor 
system.  However,  this  does  not  say  the  use  of  an  extremely  large  memory 
will  always  do  some  good.  When  we  reduce  the  job  size  scaling  factor 
down  to  0.5,  this  is  equivalent  to  using  a  memory  which  is  twice  as  large, 
we  still  get  the  same  performance,  but  the  memory  utilization  has  been 
decreased  to  only  42%.  This  means  we  are  wasting  the  memory  and  getting 
no  improvement  at  all.  So,  an  appropriate  amount  of  memory  should  be  used 
in  order  to  get  both  good  performance  and  good  utilization.  From  Table  19, 
we  can  see  1024K  bytes  is  the  best  memory  size  for  the  workload  we  are 
using.  This  is  why  we  use  this  memory  size  throughout  this  chapter. 

Finally,  let  us  look  at  the  effect  of  increasing  CPU  time  or  the 
number  of  I/O  requests.  Both  of  them  will  increase  the  time  a  job  has  to 
spend  in  the  memory.  This  in  turn  will  increase  the  time  those  waiting 
jobs  have  to  spend  in  the  queue.  Therefore,  the  increase  of  either  CPU 
time  or  the  number  of  I/O  requests  of  a  job  has  a  twofold  effect  on  the 
average  turnaround  time.  We  briefly  mentioned  this  in  Section  3.1.2  when 
we  compared  monoprogramming  and  multiprogramming. 

We  will  discuss  the  effect  of  increasing  CPU  time  first.  Again, 
we  use  a  CPU  Time  Scaling  Factor  to  scale  the  processing  time  of  each  job. 
Figure  31  shows  how  the  average  turnaround  time  responds  when  we  increase 
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Job  Size  Seal 

ing 

Factor 

Ta 

Um 

0.5 

84 

0.42 

1.0 

83 

0.58 

1.5 

96 

0.76 

2.0 

209 

0.87 

2.5 

402 

0.88 

Table  12.  The  Average  Turnaround  Times  and  Memory 
Utilizations  for  Different  Job  Sizes. 
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(or   decrease)  the  CPU  time.  The  curve  first  grows  slightly  more  than 
linearly,  and  then  starts  taking  off  if  we  double  the  CPU  time  scaling  fac- 
tor. This  is  not  surprising  since  the  system  is  getting  saturated  when  we 
scale  the  CPU  time  beyond  2.0. 

The  use  of  scaling  factor  effectively  "stretches  out"  the  distribu- 
tion curve  since  every   job  is  enlarged  by  the  same  factor.  Theoretically, 
if  we  use  a  random  number  generator  with  the  same  distribution  and  the  scaled 
mean  to  generate  a  job  stream,  these  jobs  should  have  a  distribution  roughly 
the  same  as  the  stretched  distribution.  In  other  words,  two  methods  should 
produce  the  same  characteristics.  The  dotted  curve  in  Figure  21  displays 
the  result  by  using  the  random  number  generator  method.  We  can  see  there 
is  not  much  difference  between  these  two  methods.  Therefore,  we  use  the 
scaling  factor  method  since  it  is  easier  to  apply. 

The  curves  of  Figure  31  are  very   similar  to  that  of  Figure  30. 
In  fact,  they  almost  coincide  with  each  other.  Apparently,  both  CPU  time 
and  job  size  have  the  same  effect  on  the  system  performance.  One  other 
interesting  point  is,  all  these  curves  are  very   similar  to  the  turnaround 
time  curve  of  an  M/M/l  queueing  system.  The  later  can  be  expressed  by 
the  following  equation: 

T  =  L/M  _   J 


1-p     M-X 

where  m   is  the  service  rate,  X  is  the  arrival  rate,  and  p  =  A//i  is  the  system 
utilization.  Increasing  the  CPU  time  or  the  size  of  a  job  is  in  fact  equi- 
valent to  reducing  the  service  rate.  T  increases  when  we  decrease  m.  When  m 
gets  very  close  to  A,  T  will  become  extremely  large  since  the  system  is 
about  to  saturate.  This  can  explain  why  the  Ta  curve  goes  up  very  sharply 
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when  we  scale  the  CPU  time  beyond  a  certain  limit. 

Table  13  shows  the  average  turnaround,  queueing  and  service  times 
of  a  job  when  we  use  a  different  scaling  factor.  As  we  can  see,  most  of  the 
increment  comes  from  the  average  queueing  time.  This  is  what  we  should 
expect,  since  each  job  will  remain  in  the  memory  for  a  longer  time  but  the 
arrival  rate  remains  the  same,  hence  more  jobs  will  be  accumulated  in  the 
waiting  queue  and  a  new  arrival  will  have  to  spend  much  more  time  in  the 
queue. 

When  we  design  a  system,  we  should  be  very   careful  about  the  job 
arrival  rate,  the  average  CPU  time,  and  the  processing  power  we  have.  If 
the  average  turnaround  time  falls  on  the  high  rising  edge  of  the  curve,  we 
should  try  to  lower  it  by  adding  more  hardware. 

3.2  Results  for  Hardware  Related  Questions 

In  the  second  half  of  this  chapter,  we  are   going  to  answer  the 
hardware  questions  we  listed  in  Chapter  1.  Mainly,  we  will  investigate  the 
effects  of  using  different  amounts  and  different  speeds  of  hardware,  then 
find  a  cost-effective  way  of  building  a  machine. 

We  will  also  look  into  a  very   interesting  problem,  namely,  the 
processor-memory  inter-connection  problem.  The  results  we  have  shown  so 
far  are  using  a  full  processor-memory  connection  scheme,  i.e.,  each  processor 
is  physically  connected  to  every   memory  module  and  can  access  any  module. 
For  example,  a  crossbar  switch  is  a  full  connection  scheme.  Under  this 
scheme,  each  module  can  be  assigned  to  any  processor.  Jobs  cannot  be  pre- 
vented from  entering  the  system  due  to  the  inability  to  connect  available 
memory  to  an  available  processor. 
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CPU 

Time  Scaling  Factor 

Ta 

q 

X 

0.8 

74 

n 

63 

1.0 

83 

12 

71 

1.2 

94 

19 

75 

1.4 

108 

29 

79 

1.6 

128 

41 

87 

1.8 

155 

62 

93 

2.0 

230 

131 

99 

Table  13.  The  Effect  of  Increasing  CPU  Time  on  the 
Average  Queueing  Time  (q)  and  Average 
Service  Time  (x) . 
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However,  as  we  mentioned  in  Chapter  1,  a  full  connection  is  very 
expensive.  Its  cost  can  go  up  very   quickly  if  we  want  to  expand  the  system. 
For  example,  if  we  want  to  double  a  4x8  system  to  a  8x16  system,  the  cost 
of  the  connection  network  will  go  up  four  times.  Besides,  a  full  con- 
nection network  usually  is  not  easy  to  expand.  For  example,  in  a  crossbar 
switch,  it  is  very   difficult  to  increase  the  size  of  the  fan-out  tree,  since 
that  requires  a  complete  rewiring  between  the  fan-out  tree  and  fan- in  tree. 
Or,  in  a  mul  ti port  memory  system,  we  will  have  to  replace  all  the  memory 
interfaces  if  the  number  of  processors  of  the  expanded  system  exceeds  the 
number  of  ports  in  a  single  module. 

Therefore,  we  are  interested  in  using  a  partial  connection  network. 
Let  us  recall  the  connection  network  of  the  PRIME  system  shown  in  Figure  1. 
Each  processor  is  connected  to  8  memory  modules  via  a  private  bus.  Each 
memory  module  has  four  ports,  so  up  to  four  processors  can  connect  to  one 
module.  In  the  current  PRIME  system,  there  are  five  processors  and  13  memory 
modules,  so  40  out  of  the  52  ports  are   being  used.  This  is  a  typical  partial 
connection  system.  In  our  study  of  partial  connection,  we  will  assume  this 
kind  of  architecture. 

Of  course,  we  would  expect  some  degree  of  performance  degradation* 
since  each  processor  only  connects  to  a  subset  of  the  modules  and  thus 
can  only  access  part  of  the  memory.  Hence,  a  processor  cannot  be  assigned 
to  a  job  if  the  memory  attached  to  this  processor  is  not  big  enough,  even 
though  the  total  available  space  in  the  memory  is  big  enough.  Consequently, 
the  probability  that  a  job  will  be  blocked  is  larger  in  a  partial  connection 
system.  However,  the  cost  of  a  partial  connection  does  not  grow  as  fast 
as  a  full  connection.  In  fact,  it  will  grow  linearly  if  we  use  multiport 
memories. 
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Most  of  all,  the  partial  connection  does  allow  us  to  expand  the 
system  without  too  much  trouble.  If  we  are  increasing  the  memory,  we  can 
just  connect  the  additional  module  to  some  processor  arbitrarily  or  following 
some  rule.  If  we  are  also  increasing  the  number  of  processors,  we  might 
need  to  move  some  connectors  to  reconf igurate  the  whole  system.  But,  no 
hardware  modification  is  needed. 

We  will  study  the  performance  degradation  of  a  partial  connection 
network.  In  addition,  we  will  look  at  an  interesting  problem,  namely,  how 
should  we  interconnect  the  processors  and  the  memories  such  that  we  can  get  a 
minimal  degradation.  In  general,  a  processor  will  be  allowed  to  connect 
to  only  part  of  the  memory  in  a  partial  connection  system.   How  many 
modules  should  be  assigned  to  a  particular  processor  will  greatly  influ- 
ence the  job  handling  ability  of  that  processor.  Obviously,  the  more  memory  a 
processor  connects  to,  the  larger  job  it  can  handle. 

In  the  PRIME  system,  all  processors  are  connected  to  the  same 
number  of  memory  modules,  namely,  8.  Hence,  all  of  them  are  equally 
"capable."  Of  course,  this  is  based  on  the  assumption  that  no  job  will  ever 
require  more  than  eight  modules,  and  the  fact  that  there  are  enough  ports 
in  the  memory.  Obviously,  if  any  of  these  two  conditions  are  violated,  this 
connection  will  not  work  any  more,  and  we  need  a  new  configuration.  For 
example,  if  some  jobs  need  12  memory  modules,  then  at  least  one  processor 
should  be  connected  to  that  many  modules.  However,  not  all  of  them  can 
have  12  memory  modules  since  for  5  processors  that  would  require  a  total 
of  60  memory  ports.  Therefore,  an  "uneven"  connection  might  be  more  effec- 
tive, that  is,  some  processor  will  have  more  memory  and  some  will  have  less. 
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How  to  distribute  the  total  available  ports  to  the  processors  in  this  case 
is  a  very  interesting  distribution  problem.  This  is  part  of  the  intercon- 
nection problem  we  will  be  looking  at  later. 

After  we  determine  the  number  of  memory  modules  each  processor 
will  get,  another  problem  arises  immediately,  namely,  which  module  we 
should  choose  to  connect  to  a  certain  processor.  In  the  PRIME  system,  this 
is  done  in  a  rather  uniform  way.  Each  processor  will  connect  to  8  consecu- 
tive modules  with  the  leading  module  two  modules  to  the  right  of  the  leading 
module  of  the  previous  processor.  Of  course,  this  might  not  be  a  good  way 
to  arrange  the  connection.  However,  it  is  very   difficult  to  come  out  with 
an  appropriate  analytic  argument  to  show  what  a  good  connection  might  be. 
What  we  will  try  to  do  is  to  simulate  several  combinations  and  compare  their 
results.  Then,  analyzing  the  connections  which  yield  better  results  we  can 
get  some  ideas  of  what  we  might  need  to  do  in  order  to  achieve  a  good  con- 
nection. This  is  the  second  part  of  the  interconnection  problem  we  will 
be  studying. 

3.2.1  Hardware  Quantity  Effect 

Let  us  now  look  at  how  the  system  responds  when  we  increase  the 
hardware.  Of  course,  the  more  hardware  we  add  into  the  system,  the  better 
performance  we  should  get.  What  we  want  to  know  here  is  how  sensitively  each 
type  of  hardware  resource  will  affect  the  system  performance.  Therefore, 
we  shall  know  what  to  buy  in  order  to  achieve  a  certain  percentage  of  im- 
provement and  spend  as  little  money  as  possible. 

In  our  system,  there  are  four  kinds  of  hardware  we  can  increase: 
the  total  amount  of  memory  M,  the  number  of  processors  p,  the  number  of  memory 
modules  m,  and  the  number  if  I/O  devices  r.  We  will  investigate  the  effect 
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of  each  individual  parameter.  So,  when  we  change  a  certain  parameter 

we  will  assume  the  other  three  are  fixed.  At  the  end  of  this  section,  we  will 

also  show  what  will  happen  when  we  increase  them  simultaneously. 

The  effect  of  increasing  the  total  amount  of  memory  M,  keeping  p, 
m,  and  r  fixed,  is  actually  the  same  as  that  of  decreasing  the  size  of 
each  job.  This  is  because  both  actions  have  the  same  effect  of  allowing  more 
jobs  to  enter  the  memory  and  get  executed  at  the  same  time.  Obviously, 
doubling  the  memory  M  is  equivalent  to  halving  the  size  of  each  job.  There- 
fore, changing  M  should  yield  the  same  result  as  that  shown  in  Figure  30, 
except  if  we  plot  Ta  against  increasing  M  the  curve  will  become  exponen- 
tially decreasing  instead  of  increasing.  Our  simulation  indeed  shows  ex- 
actly the  same  numerical  result  and  so  we  omit  the  repetition. 

The  effect  of  increasing  the  number  of  processors  p  has  actually 
been  shown  in  Figures  18  and  20.  In  Figure  32,  we  repeat  this  for  three 
different  m  values.  What  we  want  to  do  here  is  to  compare  the  effects  of 
increasing  p  and  m.  In  Figure  33,  we  use  the  same  values  of  Ta  and  plot 
the  curves  against  m. 

From  Figure  32,  we  can  see  how  important  it  is  to  have  enough 
processors.  In  this  case,  at  least  five  processors  should  be  used  in  order 
to  get  good  performance.  We  have  stressed  the  same  point  earlier  in  this 
chapter.   In  Table  14,  we  show  both  the  memory  utilization  and  the  average 
memory  bandwidth  for  each  job.  We  can  see  in  Table  14-b  the  job  memory 
bandwidth,  i.e.,  the  average  memory  bandwidth  each  job  gets,  does  not  change 
when  we  increase  p.  This  is  what  we  should  expect  since  we  are  dealing  with 
a  monoprogrammed  system.  So,  the  performance  improvement  must  come  from 
the  increase  of  memory  utilization  alone.  In  Table  14-a,  we  can  see  the 
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Figure  32.  The  Effect  of  p  on  the  Average  Turnaround  Time  Ta 
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Figure  33.  The  Effect  of  m  on  the  Average  Turnaround  Time  Ta 
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PX\ 

16 

24 

32 

4 

41.3,   14.1 

41.3,     9.5 

41.4,     7.0 

5 

51.5,   17.6 

50.0,    11.5 

49.1,     8.3 

6 

53.0,   18.0 

51.8,   11.8 

51.1,     8.7 

7 

53.4,   18.4 

52.4,   12.0 

51.6,     8.7 

8 

54.2,   18.5 

53.1,   12.2 

52.2,     8.9 

(a)  Percentages  of  Memory  Utilization  and  Memory 
Waste  of  the  Partitioned  Scheme 


n.  m 
P     \. 

16 

24 

32 

4 

1.35 

1.45 

1.52 

5 

1.36 

1.46 

1.52 

6 

1.36 

1.46 

1.52 

7 

1.36 

1.46 

1.52 

8 

1.36 

1.46. 

1.52 

(b)  Job  Memory  Bandwidth 


Table  14.  The  Memory  Utilization,  Memory  Waste,  and  Job  Memory 
Bandwidth  for  the  System  of  Figures  32  and  33. 
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(  B;  ,  q  ) 


^\  m 

16 

24 

32 

4 

2.69,  852 

2.78,  708 

2.84,  629 

5 

3.27,  114 

3.28,  74 

3.29,  67 

6 

3.30,  41 

3.31,  30 

3.31,  29 

7 

3.31,  25 

3.32,  17 

3.33,  16 

8 

3.32,  23 

3.33,  15 

3.33,  14 

Table  15.  The  Total  Memory  Bandwidth  (B  )  and  Average 
Queueing  Time  (q)  for  the  System  of  Figures 
32  and  33. 


149 


memory  utilization  increase  by  as  much  as  13%  as  we  double  the  processors 
from  4  to  8.  When  p=4,  the  memory  is  indeed  under-utilized.  Of  course, 
this  is  because  we  only  allow  up  to  four  jobs  in  the  memory  so  a  lot  of 
jobs  have  to  wait  in  the  outside  queue.  We  can  see  from  Table  15  that  the 
queueing  time  for  p=4  is  yery   large.  When  we  increase  p,  we  actually  allow 
more  jobs  to  be  in  the  memory  at  the  same  time,  and  hence  the  memory  utiliza- 
tion goes  up,  and  the  queueing  time  goes  down.  Meanwhile,  the  average  ser- 
vice time  increases  due  to  the  competition  of  I/O  devices.  For  large  p, 
say  7  or  8,  the  curves  become  flat,  and  no  improvement  will  result  even  if 
we  add  more  processors.  From  our  simulation  result,  we  see  that  most  of  the 
time  the  memory  can  only  contain  six  to  seven  jobs.  So,  any  more  processors 
beyond  that  will  simply  be  wasted. 

From  Figure  33,  we  can  see  the  turnaround  time  also  goes  down  when 
we  increase  m.   In  fact,  the  increase  of  m  has  two  fold  effect  on  the  system 
performance.  It  can  reduce  the  memory  waste  since  the  module  size  will  be 
reduced.  (Notice  that,  when  we  increase  m,  we  are   holding  the  total  amount 
of  memory  fixed.)  This  can  be  seen  in  Table  14-a.  Also,  it  can  increase  the 
memory  bandwidth  since  the  degree  of  interleaving  for  each  job  will  be  in- 
creased. This  can  be  seen  in  Tables  14-b  and  15.  Since  the  speed  ratio  s 
is  only  2,  the  increase  of  m  will  only  cause  small  improvements  on  the  total 
memory  bandwidth  and  the  job  memory  bandwidth. 

When  we  increase  m  from  16  to  24,  a  significant  improvement  has 
been  achieved.  This  is  because  the  memory  waste  has  been  reduced  significantly, 
and  the  job  memory  bandwidth  has  been  increased  by  a  non-trivial  percentage 
(about  8%  in  this  case).  When  we  again  increase  m  from  24  to  32,  a  small 
change  has  been  made  on  the  turnaround  time  except  when  p  is  small.  Therefore, 
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an  m  value  of  24  should  be  enough  to  achieve  good  performance.  This  phe- 
nomenon can  also  be  seen  in  Figure  19  and  22. 

In  Table  14-a,  we  can  see  the  memory  utilization  actually  decrease 
when  we  increase  m.  This  is  because  the  throughput  has  been  increased,  and 
a  job  will  stay  in  the  memory  for  a  shorter  period  of  time.  We  have  explained 
this  when  we  discuss  Table  7. 

From  Figures  32  and  33,  we  can  see  that  the  number  of  processors 
has  the  most  profound  effect  on  the  system  performance.  If  we  do  not  have 
enough  processing  power,  the  increase  of  any  other  hardware  will  turn  out 
to  be  wasteful . 

Figure  34  shows  the  effect  of  another  parameter  r,  the  number  of 
I/O  devices.  Here  we  use  8  processors,  24  memory  modules,  and  the  partitioned 
scheme.  As  we  can  see,  the  breaking  points  occur  at  r=4.  When  we  increase 
the  number  of  I/O  devices  from  2  to  4,  the  average  turnaround  time  improves 
significantly.  This  can  be  explained  by  Table  16.  Table  16  shows  the  average 
queueing  time  of  a  job  waiting  for  an  I/O  device.  For  r=2,  the  queueing  time 
is  very   large.  Apparently,  many  jobs  jam  up  in  the  I/O  queue  due  to  lack  of 
I/O  channels.  When  we  double  the  number  of  I/O  devices  to  4,  the  queueing 
time  drops  drastically.  This  can  be  seen  in  the  first  two  rows  in  Table  16. 
The  reason  is  rather  simple:  since  we  are  using  monoprogramming  and  8  proces- 
sors, at  most  8  jobs  will  be  in  the  memory  simultaneously,  and  since  the  jobs 
are  not  particularly  I/0-bound,  the  probability  that  more  than  four  jobs  are 
doing  I/O  is  small.  Hence,  four  I/O  channels  will  be  enough  in  this  case. 
Further  increase  of  I/O  channels  gives  very   little  improvement  to  the  per- 
formance. In  fact,  this  is  also  true  for  the  multiprogramming  case,  except 
the  multiprogramming  case  shows  a  little  bit  higher  queueing  time  since  on 
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r 

Unit  I/O 
Time(ms) 

21 

28 

35 

42 

2 

10.2 

23.3 

40.5 

53.6 

4 

0.9 

1.9 

3.6 

6.6 

6 

0.9 

1.8 

3.5 

6.3 

8 

0.9 

1.8 

3.5 

6.3 

Table  16.  The  Queueing  Time  for  I/O  Device. 
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the  average  more  jobs  will  be  competing  for  the  I/O  channels.  Of  course, 
if  we  are  dealing  with  an  I/0-bound  job  mix,  more  I/O  devices  will  be  needed 
then,  since  the  I/O  stage  will  become  the  bottleneck  of  the  system. 

From  Figure  34,  we  can  also  see  that  the  percent  difference  of 
the  turnaround  time  between  r=2  and  r=4  decreases  as  we  reduce  the  average 
time  per  I/O  operation,  i.e.,  the  curve  becomes  flatter  for  faster  I/O 
devices.  This  is  because  the  average  unit  I/O  time  has  a  twofold  effect 
on  the  turnaround  time.  When  we  reduce  the  average  unit  I/O  time,  both  the 
job  I/O  time  and  the  queueing  time  for  I/O  has  been  reduced  at  the  same  time. 
In  particular,  the  queueing  time  drops  by  a  rather  large  factor  when  the 
average  unit  I/O  time  is  reduced.  This  can  be  seen  in  each  row  of  Table  16. 

One  more  interesting  thing  is:  the  queueing  time  for  r=2  with  an 
average  unit  I/O  time  of  21  ms  is  almost  60%  larger  than  that  for  r=4  with 
an  average  unit  I/O  time  of  42  ms  (Table  16).  However,  the  average  turnaround 
time  for  the  former  case  is  about  25%  better  than  the  average  turnaround  time 
for  the  latter  (Figure  34).  The  first  case  can  be  explained  as  follows: 
Although  both  cases  have  the  "same"  power,  using  four  slow  I/O  devices  a 
job  will  not  suffer  any  delay  unless  there  are  already  four  or  more  jobs 
doing  I/O.  In  other  words,  a  job  has  a  lower probabil ity  of  being  enqueued. 
But,  using  a  faster  I/O  device  can  reduce  the  total  job  I/O  time.  Apparently, 
in  these  two  cases  what  we  gain  from  the  total  I/O  time  is  much  more  than 
what  we  lose  in  the  queueing  time.  This  implies  that  if  the  cost  of  an  I/O 
device  with  an  average  I/O  time  of  21  ms  (e.g.,  IBM  3330  disk  unit)  is  no 
more  than  twice  of  the  cost  of  an  I/O  device  with  an  average  I/O  time  of 
42  ms  (e.g.,  IBM  2314  disk  unit),  then  it  might  be  a  better  idea  to  use  half 
as  many  of  the  faster  one.  We  will  further  look  at  the  effect  of  the  I/O 
speed  in  the  next  section. 


154 


Now,  let  us  look  at  how  the  system  performance  reacts  when  we  simultaneously 
increase  the  number  of  the  processors,  memory  modules,  and  I/O 
devices.  Of  course,  we  would  expect  the  system  performance,  e.g.,  the 
average  turnaround  time,  to  be  improved  at  a  much  larger  rate  as  we  increase 
them  at  the  same  time.  But,  in  order  to  make  a  fair  comparison  of  the  "capa- 
bility" of  each  system,  we  will  also  adjust  the  workload  of  each  system  by 
scaling  the  job  arrival  time.  For  example,  if  we  double  the  size  of  a  cer- 
tain system,  we  will  also  double  the  system  workload  by  doubling  the  job 
arrival  rate. 

Figure  35  shows  how  the  average  turnaround  time  reacts  when  we 
double  the  system  size  from  (4,12,2)  to  (8,24,4),  then  to  (16,48,8).  Of 
course,  we  double  the  job  arrival  rate  everytime  we  double  the  system  size. 
Notice  that  in  these  experiments,  when  we  double  the  number  of  modules  we 
also  double  the  total  memory  size  since  we  keep  the  module  size  the  same. 
In  the  (4,12,2)  system,  we  use  512K  bytes  of  main  memory.  So,  we  use  1 024 K 
and  2048K  bytes  in  (8,24,4)  and  (16,48,8)  systems  respectively. 

As  we  can  see,  the  Ta  curves  (solid  curves)  drop  roughly  exponen- 
tially as  we  double  the  system  size,  even  though  we  double  the  job  arrival 
rate  at  the  same  time.  This  means  that  if  we  double  the  system  size,  we 
should  be  able  to  handle  more  than  twice  the  workload.  In  Table  17,  we 
show  the  total  memory  bandwidth,  the  job  memory  bandwidth  ,  the  average  num- 
ber of  jobs  in  the  system,  and  the  queueing  time  for  three  different  system 
sizes.  As  we  can  see,  the  total  memory  bandwidth  increases  very   rapidly 
when  we  double  the  system  size.  This  is  the  major  reason  that  the  average 
turnaround  time  improves  so  quickly  since  the  total  memory  bandwidth  is  the 
amount  of  work  being  done  in  a  unit  of  time.  Only  the  job  memory  bandwidth 
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^\          System 

\-(p,m,r) 

Al location^. 

(4,12,2) 

(8,24,4) 

(16,48,8) 

Scheme             ^v. 

3.57 

6.66 

9.97 

Distributed 

2.67 
3.04 

2.79 
5.19 

2.88 
9.35 

65.8 

11.1 

1.4 

3.39 

6.60 

9.91 

Mixed 

1.75 
3.50 

1.75 
6.46 

1.76 
12.16 

141.6 

27.9 

4.2 

3.36 

6.57 

9.89 

Partitioned 

1.75 
3.38 

1.75 
6.40 

1.75 
12.09 

266.6 

37.1 

5.8 

Table  17.  The  Total  Memory  Bandwidth  (B  ),  Job  Memory  Bandwidth 

(B  ),  Average  Number  of  Jobs  in  System  (n),  and  Average 
w 

Queueing  Time  (q)  for  each  System  Size. 


157 


of  the  distributed  system  increases  when  we  enlarge  the  system  size.  This 
is  because  the  degree  of  interleaving  has  been  doubled  which  reduces  the 
memory  interference  between  jobs.  But,  the  important  thing  is  the  average 
number  of  jobs  in  every   system  almost  doubles  every   time  we  double  the  sys- 
tem size,  which  causes  the  large  increase  of  the  total  memory  bandwidth.  This 
implies  the  doubled  system  has  twice  the  capability  of  containing  jobs 
in  memory.  Of  course,  this  seems  rather  intuitive  since  we  also  double  the 
memory.  However,  in  a  smaller  system,  it  is  very   possible  that  a  few  large 
jobs  will  occupy  the  memory  and  block  the  other  jobs  for  a  long  time.  But, 
this  is  less  likely  to  happen  in  a  larger  system  since  it  is  unlikely  several 
very  large  jobs  will  compete  the  memory  at  the  same  time.  In  other  words,  a 
larger  system  will  have  a  higher  potential  of  allowing  more  jobs  to  enter 
the  memory.  So,  a  job  will  pass  through  the  large  system  quicker  than  the 
small  system.  In  queueing  theory,  this  is  called  the  diminishing  effect. 
In  Table  17,  we  can  see  the  queueing  time  decreases  very   fast  when  we  increase 
the  system  size. 

In  fact,  we  can  show  that  a  double-sized  system  can  handle  work 

1  5 
about  2.8,  or  roughly  2  *  ,  times  the  workload.  This  is  shown  by  the  dotted 

curves  in  Figure  35.  Thd  dotted  curves  are  obtained  by  the  following  method. 

Let  us  fix  the  performance  of  the  (8,24,4)  system  and  try  to 

bring  the  performance  of  the  other  two  systems  close  to  it  by  adjusting 

the  arrival  rate.  In  order  to  lower  the  turnaround  time  of  the  (4,12,2) 

system,  we  slow  down  the  arrival  rate  by  increasing  the  arrival  scaling 

factor.  Notice  that  the  larger  the  arrival  scaling  factor  is,  the  lower 

the  arrival  rate  will  be.  In  Figure  35,  we  found  that  if  we  can  use  an 

arrival  scaling  factor  of  0.28  (=0.1  x  21'5)  the  (4,12,4)  system  will 
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perform  roughly  the  same  as  the  (8,24,4)  system.  In  other  words,  we  make 
the  arrival  rate,  and  hence  the  workload,  of  the  (4,12,2)  system  2.8  times 
smaller  than  the  (8,24,4)  system  and  get  almost  the  same  turnaround  time. 
On  the  other  hand,  we  use  an  arrival  scaling  factor  of  0.039  (>0.1  t  2   ) 
for  the  (16,48,8)  system  and  the  turnaround  time  increases  a  little  bit  to 
the  neighborhood  of  that  of  the  (8,24,4)  system. 

Therefore,  all  three  systems  now  perform  more  or  less  the  same 

1  5 
but  the  workload  ratio  is  kept  roughly  at  2  '  between  two  consecutive  sys- 
tems which  have  a  size  ratio  of  2.  This  means  that  the  processing  power 

1  5 
of  a  double-sized  system  is  about  2  '  times  larger  than  the  original  sys- 
tem. If  we  let  c  be  the  size  of  a  system,  we  may  say  the  processing  power 

1  5 
of  our  system  grows  roughly  according  to  the  function  c  "  .  Since  the  cost 

of  a  system  is  directly  proportional  to  the  size  of  the  system,  we  might  as 

well  think  c  as  the  cost.  Therefore,  the  processing  power  P  of  our  system 

can  be  formulated  as  follows: 

P  =  a  c    =  a  c/c, 

where  a  is  some  proportionality  constant.  What  this  result  implies  is  the 
performance  will  grow  faster  than  linearly  as  we  increase  the  size  of  the 
system. 

However,  we  must  point  out  that  the  above  result  only  holds  in 
the  range  shown  in  Figure  35.  When  we  double  the  system  size  again  to 
(32,96,16),  the  workload  it  can  handle  to  yield  a  similar  turnaround  time 
does  not  grow  by  as  much  as  2.7  or  2.8  times  of  the  workload  of  the 
(16,48,8)  system.   In  fact,  the  (32,96,16)  system  can  only  handle  a  work- 
load of  about  2.  3  times  that  of  the  (16,48,8)  system.  Obviously,  arrival 
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rate  has  a  larger  effect  on  the  performance  than  system  size. 

1  5 
So,  c  '  does  not  hold  for  system  beyond  (16,48,8).  However, 

for  a  general  system,  (16,48,8)  is  already  a  reasonably  large  size.  We  can 

say  that  the  above  result  is  good  for  the  range  of  system  size  most  people 

will  be  interested  in. 

3.2.2  Hardware  Speed  Effect 

In  this  section,  we  are  going  to  study  the  effect  of  using  faster 
components.  There  are  two  parameters  we  will  look  at,  namely,  the  average 
unit  I/O  time  and  the  memory-processor  speed  ratio.  We  did  mention  some 
effects  of  these  two  parameters  earlier,  however,  we  will  look  at  this  prob- 
lem from  a  slighly  different  angle. 

Let  us  first  look  at  how  the  memory-processor  speed  ratio  s  will 
affect  the  system  performance.  For  the  convenience  of  this  discussion,  we 
will  assume  the  processor  speed  to  be  fixed.  So,  the  larger  the  s  is,  the 
slower  the  memory  will  be. 

Figure  36  shows  the  Ta  versus  s  curves  for  three  different  memory 
allocation  schemes.  As  we  can  see,  all  three  curves  go  up  slighly  more  than 
linearly  as  we  slow  down  the  memory  speed.  However,  the  slope  of  the  curve 
for  the  distributed  system  is  smaller  than  those  of  the  other  two  systems. 
This  means  that  the  memory  speed  has  less  effect  on  the  distributed  system. 
We  can  explain  by  using  the  bandwidth  equation  and  the  following  example. 

Assume  we  have  eight  memory  modules,  and  three  jobs  in  the  memory 
which  require  1,  3,  and  4  modules  respectively.  Let  us  compare  the  parti- 
tioned system  and  the  distributed  system.  If  we  use  the  random  distribu- 
tion assumption,  we  can  apply  Ravi's  equation  [38]  to  compute  the  memory 
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System 
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Distributed 


2.64 
0.88 


4.41 
0.73 


5.60 
0.62 


6.39 
0.53 


6.92 
0.46 


Partitioned 


3.00 
1.00 


4.42 
0.73 


5.42 
0.60 


6.14 
0.51 


6.65 
0.44 


Table  18.  The  Total  Memory  Bandwidths  (  Accesses  per 
Memory  Cycle  )  and  the  Number  of  Words  a 
Processor  can  get  per  Processor  Cycle  for 
5  Different  s  Values. 
(  3  Jobs  in  8  Memory  Modules  ) 
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0«- 

bandwidth.  For  the  distributed  system,  the  bandwidth  will  be  8[1-(1  -  1/8)  ], 

c 

and  for  the  partitioned  system,  it  will  be  1+3[1-(1  -  1/3)  ]  + 
[1-(1  -  1/4)  ].  We  list  their  numerical  values  in  Table  18.  Notice  that, 
although  these  values  increase  as  s  gets  larger,  the  number  of  words  a 
processor  can  fetch  from  the  memory  in  a  certain  unit  of  time,  say  one  micro- 
second or  one  processor  cycle,  is  actually  reduced.  In  Table  18,  we  also 
list  the  average  number  of  words  each  processor  can  get  in  one  processor 
cycle.  This  is  done  by  dividing  the  bandwidth  by  3s,  since  the  memory  is 
s  times  slower  than  the  processor  and  there  are   three  jobs  in  the  memory. 
We  can  see  this  "normalized  bandwidth"  is  indeed  decreasing  when  we  increase 
s.  This  is  because  the  memory  cycle  time  is  doubled  when  we  double  the 
speed  ratio  s,  however,  the  values  we  get  by  using  Ravi's  equation  do  not 
double  at  the  same  time.  Therefore,  it  in  fact  takes  longer  to  fetch  the 
same  number  of  words  out  of  the  memory  if  s  becomes  larger.  This  is  why 
the  average  turnaround  time  increases  when  we  increase  s.  Moreover,  as  we 
can  see  from  Table  18,  the  normalized  bandwidth  (number  of  words  per  proces- 
sor cycle)  of  the  partitioned  system  decreases  faster  than  that  of  the 
distributed  system.  Thus,  the  average  turnaround  time  of  the  partitioned 
system  degrades  faster  than  that  of  the  distributed  system.  For  the  mixed 
system,  the  situation  is  very   similar  to  the  partitioned  system  except  the 
turnaround  time  is  a  little  bit  better.  This  is,  of  course,  because  the 
mixed  system  has  a  rather  similar  way  of  storing  the  jobs  but  with  a  little 
bit  better  memory  space  util iazation. 

However,  we  can  also  look  at  Figure  36  from  the  opposite  angle. 
When  we  increase  the  memory  speed,  or  decrease  s,  the  average  turnaround 
time  of  the  partitioned  system  improves  faster  than  that  of  the  other  two 
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systems.  For  example,  in  this  case  the  Ta  of  the  partitioned  system  drops 
from  146  down  to  83  as  we  reduce  s  from  5  to  2.  That  is  a  reduction  of 
43%.  For  the  mixed  system  and  the  distributed  system,  the  reductions  are 
39%  and  12%,  respectively.  Therefore,  the  memory  speed  is  a  very  important 
factor  when  we  are  using  the  partitioned  scheme.  After  all,  the  partitioned 
system  might  yield  a  better  bandwidth  when  the  speed  ratio  is  very  small. 
We  can  see  this,  for  example,  in  the  s=l  column  of  Table  18. 

Recently,  the  memory  technology  has  provided  system  designers  with 
faster,  cheapter,  and  higher  density  semiconductor  memories.  It  is  now 
economically  feasible  to  design  a  system  which  operates  in  the  small  speed 
ratio  region.  This  makes  the  partitioned  scheme  more  attractive  to  use, 
since  it  provides  a  very   high  reliability  as  well  as  competitive  performance 

The  second  speed  parameter  we  are  going  to  look  at  is  the  average 
period  of  time  spent  on  an  I/O  operation,  or  what  we  call  the  average  unit 
I/O  time.  We  have  shown  some  effect  of  the  average  unit  I/O  time  in  the 
last  section.  Now,  let  us  look  at  how  this  parameter  affects  the  system 
performance. 

Figure  37  shows  the  Ta  curves  versus  the  average  unit  I/O  time 
for  all  three  memory  allocation  schemes.  All  the  curves  have  roughly  the 
same  increasing  rate  as  we  increase  the  average  unit  I/O  time.  This  rate 
is  larger  than  linear.  We  have  explained  that  the  reason  is  that  this 
average  unit  I/O  time  has  a  two  fold  effect  on  the  average  turnaround  time 
since  it  not  only  increases  the  total  I/O  time  but  also  increases  the 
queueing  time  indirectly.  So,  as  we  use  slower  I/O  devices,  the  average 
turnaround  time  degrades  rather  quickly. 

One  figure  we  should  pay  some  attention  to  is  when  we  halve  the 
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Figure  37.  The  Effect  of  the  Average  Unit  I/O  Time. 


165 


unit  I/O  time  from  42  ms  to  21  ms  the  average  turnaround  times  all  decrease 
by  more  than  40%.  This  is  a  larger  effect  than  that  of  halving  the  memory 
cycle  time.  Therefore,  using  a  faster  I/O  device  will  have  a  more  sig- 
nificant improvement  on  the  performance,  at  least  given  the  I/O-execution 
time  balance  of  our  job  load.  Needless  to  say,  the  effect  will  be  even 
larger  if  we  are  dealing  with  an  I/0-bound  job  mix. 

Of  course,  the  type  and  numbers  of  I/O  devices 
will  depend  on  the  cost  and  the  resulting  performance.  For  example,  in 
Figure  34,  we  can  see  that  using  two  21  ms  I/O  devices  can  yield  a  better 
turnaround  time  than  using  four  42  ms  I/O  devices,  and  if  the  faster  I/O 
device  does  not  cost  more  than  twice  of  the  cost  of  the  slower  I/O  device 
it  is  obvious  that  the  faster  I/O  will  be  a  better  choice.  However,  it 
might  be  extremely  expensive  to  replace  a  slower  I/O  device  by  a  faster 
I/O  device,  since  it  may  involve  the  replacement  of  the  I/O  controller  and 
some  very   expensive  equipment.  So,  it  is  very   important  to  understand  the 
effect  of  the  I/O  device  before  we  can  decide  what  to  use  in  a  system. 

3.2.3  Partial  Connection 

In  all  the  discussions  we  have  up  to  this  point,  we  assume  our 
system  to  have  a  fully  connected  switching  network  (full  connection  in 
short),  e.g.,  a  crossbar  network.  As  we  described  earlier,  in  this  kind 
of  system  a  processor  is  physically  connected  to  all  the  memory  modules 
and  can  access  any  module  if  it  is  allowed  to  do  so.  So,  the  operating 
system  can  freely  assign  any  module  to  any  processor  as  long  as  the  resource 
management  policy  is  not  violated.  However,  the  cost  of  a  full  connection 
will  grow  very   quickly  as  we  expand  the  size  of  the  system.  This  is  the 
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price  we  shall  have  to  pay  in  order  to  maintain  that  availability. 

Now,  let  us  look  at  another  kind  of  architecture  which  is  a  cheaper 
and  more  flexible  alternative  for  interconnecting  the  processor  and  the 
memories,  namely,  the  partial  connection  network.  The  best  example  of  a 
partial  connection  is  the  multiport  memory  network  used  by  the  PRIME  system, 
which  we  showed  in  Figure  1.  We  briefly  talked  about  the  advantages  and  the 
disadvantages  of  a  partial  connection  network  a  few  times  in  the  early 
chapters.  In  this  section,  we  are  going  to  elaborate  more  about  this  sub- 
ject. 

Or  course,  the  biggest  (perhaps  the  only)  disadvantage  of  the 
partial  connection  is  performance  degradation.  The  performance  degrades 
when  we  reduce  the  connections  between  the  processors  and  the  memory  modules. 
The  main  reason  is  that  the  utilization  of  the  available  memory  has  been 
seriously  restricted  by  the  partial  connection.  Very   often,  a  job  cannot 
be  put  into  the  memory  because  no  processor  has  enough  free  space  connected 
to  is,  although  the  total  unused  space  is  larger  than  what  this  job  is 
requesting.  This  can  be  explained  by  a  simple  diagram.  Figure  38  shows  a 
partially  connected  system  with  three  processors  and  six  two-port  memory 
modules  which  are  interconnected  in  a  uniform  way  similar  to  that  in  the 
PRIME  system,  and  two  jobs  a  and  b  are   occupying  four  modules  as  shown  in 
the  figure.  Suppose  a  third  job  arrives  which  requires  two  modules.  It 
cannot  enter  the  memory  since  the  third  processor  has  only  one  unoccupied 
module  attached  to  it,  although  there  are  two  free  modules  available  in  the 
system.  Obviously,  the  memory  will  be  wasted  due  to  this  incomplete  inter- 
connection. As  a  result,  the  system  performance  is  degraded. 

The  memory  waste  caused  by  the  partial  connection  is  quite  dif- 
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Figure  38.     A  Partial   Connection  Network 
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ferent  from  that  caused  by  the  monoprogramming  scheme  .   The 
memory  waste  here  is  created  by  the  inaccessibility  of  a  processor  to  a 
memory  module.   For  example,  in  Figure  38,  the  fourth  module  is  wasted  since 
processor  3  does  not  connect  to  this  module.  Therefore,  if  the  processors 
that  connect  to  a  certain  module  are  all  assigned  jobs,  then  the  unused 
portion  of  this  module,  probably  the  whole  module,  is  wasted. 

The  memory  waste  highly  depends  on  the  number  of  processors  at- 
tached to  a  module.  If  we  reduce  the  number  of  processors  that  can  access 
a  module,  the  probability  that  part  or  all  of  this  module  will  be  wasted  is 
increased.  Of  course,  when  the  memory  waste  increases  the  system  performance 
gets  worse. 

In  our  discussion  of  the  partial  connection,  we  will  always  assume 
all  the  ports  of  a  module  are  connected  to  processors  and  none  is  left  un- 
used. So,  the  number  of  processors  connected  to  a  memory  module  is  equiva- 
lent to  the  number  of  ports  the  module  has.  We  will  also  assume  that  all 
the  modules  are  identical  with  a  number  of  ports  which  is  no  more  than  the 
total  number  of  processors  in  the  system.  Since  in  a  partially  connected 
system  all  the  processors  might  not  connect  to  the  same  number  of  memory 
modules,  we  will  use  an  array  of  integers  to  represent  this  system  which  can 
tell  us  how  many  modules  a  certain  processor  is  connected  to.  For  example, 
we  will  represent  the  system  in  Figure  38  by  (4,4,4).  In  fact,  the  order 
of  these  integers  is  not  important  since  the  processor  number  is  arbitrarily 
assigned.  Notice  that  the  sum  of  these  integers  is  equal  to  the  total  num- 
ber of  ports  in  the  memory.  Of  course,  these  numbers  do  not  reveal  the  in- 
formation about  how  the  connections  are  being  made.  If  the  details  of  the 
connection  are  important  to  the  discussion,  we  will  use  a  method  called 
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"connection  matrix"  to  represent  the  connection.  But  in  general,  the  con- 
nection will  be  very   uniform  as  that  in  Figure  38. 

The  connection  matrix  is  a  succint  representation  of  a  partial 
connection.  The  matrix  has  p  rows  and  m  columns  with  each  entry  indicating 
the  connection  between  a  processor  and  a  memory  module.  If  a  con- 
nection is  made  between  processor  I   and  module  /,  we  will  put  a  1  at  the 
position  (l,j),   otherwise,  the  entry  will  be  0.  For  example,  the  partial 
connection  in  Figure  38  can  be  represented  by  the  following  3  by  6  connec- 
tion matrix: 


11110  0 
0  0  1111 
110  0  11 


Notice  that  all  the  column  sums  are  equal  to  the  number  of  ports  of  a  module, 
and  each  row  sum  is  equal  to  the  number  of  modules  connected  to  the  cor- 
responding processor. 

Figure  39  shows  how  the  average  turnaround  time  degrades  when  we 
reduce  the  number  of  ports  of  each  memory  module,  or  equivalently  the  number 
of  processors  connected  to  a  memory  module.  When  the  number  of  ports  per 
module  is  8,  it  is  equivalent  to  the  full  connection  since  we  are  using  8 
processors.  If  we  reduce  the  number  of  ports  to  4,  only  half  of  the  proces- 
sors will  connect  to  every  module.  Here,  we  use  a  (12,12,12,12,12,12,12,12) 
connection.  The  processors  are  connected  to  the  memory  modules  in  a  uni- 
form way.  Each  processor  is  connected  to  12  consecutive  modules  with  the 
leading  module  being  skewed  three  modules  to  the  right  of  the  previous 
leading  module.  Using  the  connection  matrix  representation,  this  connection 
can  be  expressed  by: 
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Figure  39.  The  Performance  Degradation  of  the  Partial 
Connection  System. 
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111111111111000000000000 
000111111111111000000000 
000000111111111111000000 
000000000111111111111000 
000000000000111111111111 
111000000000000111111111 
111111000000000000111111 
111111111000000000000111 


For  three-port  modules,  we  have  72  ports  altogether.      If  we  use 
the  similar  connection,  every  processor  will   have  9  modules.     However,   in 
our  job  mix  we  allow  a  job  to  claim  up  to  500K  bytes  memory.     So,  at  least 
one  processor  should  have  12  modules,   i.e.,   half  of  the  total  memory,  other- 
wise no  processor  can  handle  those  big  jobs.     Consequently,  not  all   proces- 
sors will   have  the  same  number  of  modules.      In  Figure  39,  we  use  a 
(8,8,8,8,8,8,12,12)  connection  when  the  number  of  ports   is  3.     The  first 
six  processors  are  connected  to  the  memory  in  a  similar  way  as  we  did  in  the 
4-port  memory  case  except  each  leading  module  is   skewed  by  four  modules. 
The  last  two  processors  are  connected  to  the  first  12  modules  and  the  second 
12  modules  respectively.     Again,   the  connection  can  be  expressed  by  the  fol- 
lowing connection  matrix: 

111111110000000000000000 
000011111111000000000000 
000000001111111100000000 
000000000000111111110000 
000000000000000011111111 
111100000000000000001111 
111111111111000000000000 
000000000000111111111111 
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Similarly,  we  use  a  (4,4,4,4,4,4,12,12)  connection  when  the 
number  of  ports  is  2.  This  network  is  connected  in  the  same  way  as  the 
last  one  except  now  the  first  six  processors  are  occupying  six  disjoint 
groups  of  modules.  Here  is  the  connection  matrix  for  this  connection: 

1  1  1  1  00000000000000000000 

00001   1  1  10000000000000000 

000000001  1  1  1  000000000000 

0000000000001  1  1  1  00000000 

00000000000000001   1   1   1  0000 

000000000000000000001  1  1  1 

111111111111000000000000 

000000000000111111111111 
l_ 

All  three  of  these  partial  connections  are  chosen  simply  because 
they  are  very  symmetric.  Of  course,  there  are  many  other  possible  connec- 
tions in  each  case,  and  we  will  look  at  some  of  them  later. 

In  Figure  39,  we  show  the  turnaround  time  curves  for  all  six 
combinations.  As  we  can  see,  all  six  curves  deteriorate  when  we  reduce  the 
number  of  ports  of  each  module.  However,  many  interesting  and  important 
results  are  shown  in  this  figure  which  we  are  going  to  point  out  here. 

Perhaps  the  most  noticeable  result  is  what  happens  to  the  two 
groups  of  curves,  namely,  the  monoprogramming  curves  (the  solid  ones)  and 
the  multiprogramming  curves  (the  dotted  ones).  When  we  use  full  connection 
(number  of  ports  equal  to  8),  the  multiprogramming  results  are  all  better 
than  their  corresponding  monoprogramming  results.  But  as  we  reduce  the  number 
of  ports,  the  multiprogramming  curves  get  worse  rather  quickly.  For  example, 
when  we  reduce  the  number  of  ports  to  2,  the  distributed  system  curve  in- 
creases by  about  75%  which  is  the  worst  among  all  six  curves.  The  mixed 
system  and  partitioned  system  curves  increase  by  50%  and  53%,  respectively. 
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In  the  meantime,  the  monoprogramming  curves  increase  by  relatively  small 
percentages,  39%  for  the  distributed  system  curve,  21%  for  the  mixed  system 
curve,  and  only  10.6%  for  the  partitioned  system  curve.  When  the  number  of 
ports  is  reduced  to  2,  all  the  monoprogramming  results  are  well  below  the 
multiprogramming  results.  The  monoprogramming  curves  win  by  a  margin  of 
roughly  30%.  Obviously,  multiprogramming  is  more  sensitive  to  the  connec- 
tion. 

The  degradation  of  a  monoprogramming  curve  can  be  explained  by 
the  memory  waste  caused  by  the  reduction  of  connections.  Table  19  shows  the 
word  memory  utilizations  of  all  the  systems  in  Figure  39.  The  difference 
between  the  utilizations  of  a  partially  connected  monoprogrammed  system 
and  a  fully  connected  monoprogrammed  system  is,  of  course,  the  percentage 
of  memory  being  wasted  due  to  the  reduction  of  connections.  As  we  can  see, 
in  all  the  monoprogramming  rows,  the  memory  utilization  is  strictly  de- 
creasing, which  means  the  memory  waste  is  increasing.  However,  the  memory 
waste  is  rather  small.   This  is  why  the  degradations  of  the  monoprogram- 
med systems  are  small.  Apparently,  the  reduction  of  connections  only  has 
little  effect  on  the  memory  utilization  of  a  monoprogrammed  system. 

Table  19,  however,  does  not  tell  us  the  memory  waste  of  a  multi- 
programmed  system  since  all  the  memory  utilizations  for  mul tiprogrammed  sys- 
tems are  increasing.  This  does  not  mean  that  no  memory  is  wasted  in  a  mul ti- 
programmed system.  Some  memory,  although  it  might  be  small,  must  be  wasted 
when  we  reduce  the  number  of  ports.  Since  a  module  can  only  be  used  by  part 
of  the  processors,  it  is  very   likely  this  module  will  not  be  used  up  when 
the  processors  that  attach  to  it  are  all  occupied.  Therefore,  some  other 
factor  is  causing  the  increase  in  the  memory  utilization  in  a  partial 
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-"^--^N  umber  o 

f  Ports 

System 

Module 

8(Full) 

4 

3 

2 

Mono. 

52.5 

52.0 

51.5 

51.1 

Partitioned 

Multi. 

52.0 

53.5 

54.2 

64.7 

Mono. 

53.4 

52.7 

52.6 

51.6 

Mi  xed 

Multi. 

53.8 

56.2 

57.0 

68.1 

Mono. 

51.5 

51.0 

50.4 

50.2 

Distributed 

Multi. 

52.2 

53.5 

58.1 

66.5 

Table  19.  The  Word  Memory  Utilization  of  All  the 
Systems  in  Figure  39. 
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connection  multi programmed  system. 

In  fact,  it  is  not  difficult  to  see.  The  memory  utilization  goes 
up  because  e\/ery   job  is  now  using  the  memory  for  a  longer  time.  We  have 
explained  a  similar  phenomenon  in  Section  3.1.3  when  we  discussed  Table  7. 
Let  us  show  the  average  service  time,  i.e.,  the  time  a  job  spends  in  the 
memory  of  each  mul tiprogrammed  case  in  Table  20.  As  we  can  see,  the  service 
time  t  increases  rather  rapidly  as  we  reduce  the  number  of  ports.  The  longer 
residence  of  each  job  in  the  memory  thus  results  in  a  higher  memory  utiliza- 
tion. This  is  why  the  memory  utilization  is  increasing  instead  of  de- 
creasing. Apparently,  the  memory  waste  due  to  the  partial  connection  has 
been  covered  by  this  increase.  Therefore,  we  cannot  use  memory  waste  to 
fully  explain  the  degradation  of  the  average  turnaround  time  of  a  mul tipro- 
grammed system. 

Interestingly,  when  we  look  at  the  statistics  gathered  from  simu- 
lation outputs,  we  find  that  the  queueing  time  a  job  spends  in  waiting  for  a 
processor  increases  almost  at  the  same  pace  as  the  average  service  time 
increases.  In  other  words,  the  increase  of  the  service  time  comes  from  the 
increase  of  the  queueing  time  waiting  for  a  processor.  In  Table  20,  we 
show  this  queueing  time  q  together  with  the  average  service  time. 

From  Table  20,  we  can  see  the  queueing  time  for  a  processor  can 
be  as  large  as  23%  of  the  total  service  time  (10.1/82.2  for  the  distributed 
system).  Obviously,  this  queueing  increase  is  a  big  factor  that  causes  the 
performance  degradation  of  a  mul tiprogrammed  system.  However,  there  is  no 
queueing  time  for  a  processor  in  a  monoprogrammed  system  since  each  job 
has  its  own  dedicated  processor.  The  degradation  of  a  monoprogrammed  system 
simply  comes  from  the  memory  waste.  This  explains  why  the  monoprogramming 
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System   ^\^^ 
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8(Full) 

4 

3 

2 

ls 

68.4 

69.6 

70.5 

81.2 

Partitioned 

? 

0.3 

1.6 

2.6 

13.6 

*s 

70.6 

73.3 

74.7 

87.7 

Mixed 

? 

0.4 

3.2 

4.8 

18.1 

lS 

63.9 

70.3 

74.5 

82.2 

Distributed 

q* 

0.8 

7.2 

11.4 

19.1 

Table  20.  The  Average  Service  Time  (tc)  and  Queueing  Time 

for  Processors  (a/;  of  the  Multi programmed  Systems 
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curves  deteriorate  more  slowly  than  theirmultiprogramming  counterparts.  The 
serious  queueing  delay  when  the  number  of  ports  is  small,  especially  when  it 
is  2,  is  the  major  reason  why  the  multiprogramming  curves  are  significantly 
higher  than  the  monoprogramming  curves. 

Mow,  let  us  explain  the  cause  of  this  queueing  delay.  Since  in 
a  multi programmed  system  there  might  be  more  jobs  in  the  memory  than  the 
number  of  processors,  the  competition  of  a  free  processor  is  bound 
to  happen.   If  all  the  processors  are  busy  when  a  job  come 
in  or  returns  from  I/O,  this  job  certainly  will  have  to  wait  until  some 
processor  becomes  free.  The  situation  in  a  full  connection  system  is  simple 
since  a  job  can  be  executed  by  any  processor.  So,  all  the  pending  jobs  will 
wait  in  a  single  queue  and  get  served  on  a  first-come-f irst-serve  basis. 
In  a  full  connection  case,  the  probability  that  all  processors  are  executing 
jobs  is  rather  small,  since  in  the  first  place  the  probability  that  more 
than  eight  jobs  in  the  memory  is  not  too  large,  and  secondly,  a  job  will 
spend  a  significant  amount  of  time  in  doing  I/O.  Therefore,  the  queueing 
time  for  a  free  processor  will  be  small  in  this  case.  As  we  can  see  in  the 
first  column  of  Table  20,  this  is  indeed  the  case.  Moreover,  since  the  par- 
titioned system  allows  the  smallest  number  of  jobs  in  the  memory,  its  queueing 
time  is  the  smallest  among  the  three  allocation  schemes. 

However,  the  situation  is  much  more  complicated  in  a  partial 
connection  network.  A  job  cannot  be  executed  by  every   processor  since  not 
every  processor  can  access  all  the  memory  modules  this  job  occupies.  In 
fact,  the  number  of  processors  that  can  execute  a  job  is  bounded  by  the  num- 
ber of  ports  of  a  memory  module.  Mery   often,  a  job  can  only  be  executed 
by  one  particular  processor!  This  is  especially  true  when  the  number  of 
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ports  is  small.  Let  us  look  at  an  example  in  Figure  40,  where  we  show  a 
portion  of  a  partial  connection,  multi programmed  system.  As  we  can  see, 
job  c  can  be  executed  by  both  processors  3  and  5,  but  jobs  a  and  b  can 
only  be  executed  by  processors  5  and  4,  respectively.  So,  for  example, 
job  "a"  cannot  be  executed  if  processor  5  is  executing  another  job.  In  other 
words,  a  job  can  only  queue  for  the  few  processors  that  can  execute  it  in 
a  partial  connection,  multiprogrammed  system.  This  certainly  will  cause  a 
serious  queueing  delay.  If  we  keep  reducing  the  number  of  connections  this 
situation  will  get  worse  and  worse.  This  is  why  the  queueing  time  for 
processor  grows  so  fast  when  the  port  number  is  decreased,  which  in  turn 
degrades  the  turnaround  time. 

Therefore,  the  monoprogramming  scheme  is  more  superior  than  the 
multiprogramming  scheme  if  we  are  using  partial  connection,  especially  when 
the  number  of  ports  per  module  is  small.  This  contradicts  what  happens  in 
full  connection  systems. 

Now,  let  us  concentrate  on  the  monoprogramming  curves  in  Figure  39. 
In  the  full  connection  case,  we  can  see  the  partitioned  scheme  yields  the 
worst  result  and  the  distributed  scheme  yields  the  best.  We  have  explained 
this  in  terms  of  memory  bandwidth  in  the  first  part  of  this  chapter.  How- 
ever, the  situation  starts  changing  when  we  use  the  partial  connection.  In 
the  two- port  memory  connection,  it  is  completely  reversed,  i.e.,  the  parti- 
tioned scheme  shows  the  best  turnaround  time,  and  the  distributed  scheme 
shows  the  worst.  This  phenomenon  again  can  be  explained  in  terms  of  memory 
bandwidth. 

Let  us  reuse  the  example  in  Figure  40.  In  fact,  what  we  show 
there  is  the  picture  of  a  distributed  system  if  we  assume  the  jobs  are 
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Figure  40.  An  Example  of  a  Partial  Connection,  Multi programmed  System. 
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allocated  memory  in  the  order  a,  b,  and  c.  The  biggest  difference  we 
can  see  is  the  degree  of  interleaving  each  job  can  have.  For  example,  job  a 
can  only  be  interleaved  in  three  modules  if  processor  5  does  not  connect 
to  any  other  module.  However,  in  a  full  connection,  each  job  can  be  inter- 
leaved into  as  many  modules  as  the  system  has.  This  reduction  of  the  degree 
of  interleaving  drastically  decreases  the  memory  bandwidth  of  the  distri- 
buted system.  If,  on  the  other  hand,  we  use  the  partitioned  scheme  or  the 
mixed  scheme,  each  job  can  also  be  allocated  and  interleaved  in  a  similar 
number  of  modules,  although  in  general  a  little  bit  fewer.     That  means  both 
the  partitioned  scheme  and  the  mixed  scheme  can  have  bandwidth 
comparable  to  the  distributed  scheme.  This  is  particularly  true  when 
the  number  of  ports  per  module  is  small.  But,  the  most  important  thing 
is,  in  a  partitioned  system  there  is  no  memory  conflict  between  any  two  jobs. 
In  the  other  two  schemes,  especially  the  distributed  scheme,  memory  con- 
flict will  occur  in  those  shared  modules,  which  results  in  the  degradation 
of  the  memory  bandwidth.  This  is  why  the  partitioned  system  shows  the  best 
turnaround  time  on  the  left  half  of  the  figure. 

Therefore,  we  can  come  to  the  following  two  conclusions  about 
the  partial  connection.  First,  monoprogramming  is  better  than  multiprogram- 
ming due  to  no  queueing  for  processors  in  the  former  case.  Second,  partitioned 
scheme  outperforms  the  other  two  schemes  due  to  no  memory  conflict. 
Interestingly,  both  of  these  results  completely  reverse  the  situation  in  the 
full  connection  system.  In  the  first  part  of  this  chapter,  we  kept  em- 
phasizing the  advantage  of  monoprogramming  and  the  partitioned  scheme.  However, 
the  slightly  better  performance  by  multiprogramming  and  the  distributed  scheme 
tends  to  make  these  advantages  look  rather  debatable.  But,  no  doubt  about 
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it,  a  partial  connection  system,  monoprogramming  and  partitioned  scheme 
are  better  from  every   aspect. 

Perhaps  the  most  important  result  in  Figure  39  should  be  the 
small  degradation  of  the  partitioned,  monoprogrammed  system.  When  we  re- 
duce the  number  of  ports  from  8  to  2,  the  curve  only  increases  by  10.6! 
Hence,  we  save  75%  of  the  cost  of  the  connection  network  but  sacrifice  only 
10.6%  of  the  performance.  This  is  a  tremendous  improvement  on  the  cost- 
effectiveness.  Therefore,  from  a  cost-effectiveness  point  of  view,  the 
partial  connection  system  is  a  better  architecture  for  system  design. 

Now,  let  us  discuss,  from  the  memory  utilization  point  of  view, 
why  such  a  low  degradation  can  be  achieved.  Recall  in  Figure  39,  we  used 
a  (4,4,4,4,4,4,12,12)  connection  when  each  module  has  only  two  ports. 
Each  of  the  first  six  processors  connects  to  four  modules,  and  no  two 
processors  connect  to  the  same  module.  This  essentially  partitions  the 
whole  system  into  six  disjoint  subsystems  as  far  as  these  six  processors 
are  concerned.  We  can  see  this  from  the  connection  matrix  we  showed  earlier, 
Of  course,  these  processors  then  can  only  handle  jobs  of  size  less  than 
or  equal  to  four  modules. 

Table  21  shows  the  job  size  distribution  of  the  job  mix  we  are 
using,  where  each  job  is  counted  toward  the  number  of  modules  it  will  re- 
quire under  the  partitioned  scheme.  We  can  see  81.3%  of  the  jobs  are  of 
size  less  than  or  equal  to  four  modules  (170K  bytes).  So,  most  of  the  jobs 
can  be  handled  by  these  six  processors.  Only  the  remaining  18.7%  of  the 
jobs,  i.e.,  the  large  jobs,  have  to  be  handled  by  the  other  two  processors, 
where  each  of  these  two  processors  is  connected  to  12  modules  (51 2K  bytes). 
We  let  each  of  these  two  "large"  processors  share  memory  with  three  of 
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Number  of  Modules 


Density 


1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
£11 


.165 
.146 
.326 
.176 
.105 
.010 
.031 
.019 
.009 
.003 
.010 


Module  Size  =  42  2/3  K  Bytes 


Table  21.  The  Job  Size  Distribution 
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the  other  six  "small"  processors.  This  arrangement  certainly  might  cause 
some  trouble  for  the  large  jobs.  If  a  large  job  is  under  consideration  for 
entering  the  memory  but  none  of  these  large  processors  have  enough  room, 
even  though  together  they  have  enough  free  space,  we  have  to  delay 
this  job  until  a  processor  has  gained  enough  memory  by  itself.  In  other 
words,  a  bad  distribution  of  the  small  jobs  in  the  memory  might  block  a 
large  job  from  entering.  For  example,  if  four  4-module  jobs  have  occupied 
the  first,  second,  fourth,  and  fifth  processors,  respectively,  a  5-module 
still  cannot  enter  since  none  of  the  last  two  processors  can  allocate  a 
chunk  of  five  modules  to  this  job.  So,  a  large  job  will  experience  more 
difficulties  than  it  does  in  a  full  connection.  This  arrangement  also 
has  its  own  advantage.  Apparently  not  all  of  the  jobs  running  on  the 
small  processors  will  use  exactly  four  modules,  the  unused  space  can  be 
chained  together  by  the  large  processor  to  make  room  for  another  job.  Of 
course,  most  likely  a  small  job  will  be  chosen  again  since  the  space 
usually  might  not  be  large  enough  for  a  large  job.  So,  a  small  job  which 
requires  four  memory  modules  or  less  in  fact  can  be  assigned  to  any  proces- 
sor in  the  system. 

Since  the  small  jobs  constitute  an  absolute  majority  of  the 
job  mix,  the  (4,4,4,4,4,4,12,12)  connection  should  still  allow  a  pretty 
good  memory  utilization  due  to  the  reason  we  mentioned  above.  Table  12 
indeed  shows  that  the  memory  utilization  of  this  connection  only  degrades 
a  little  bit  from  the  utilization  of  the  full  connection  system.  This  is 
the  reason  why  the  turnaround  time  increases  by  just  a  small  percentage. 

From  Table  21,  we  can  see  that  10.5%  of  the  jobs  require 
more  than  four  modules  but  no  more  than  five  modules.  People  might  wonder 
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if  the  system  should  perform  better  if  we  assign  a  few  more  modules  to  some 
of  the  small  processors  so  the  can  handle  this  10.5%  of  the  jobs  and 
alleviate  the  traffic  in  front  of  the  two  larger  processors.  We  have  col- 
lected the  results  for  several  slightly  different  connections,  and  we  found 
the  results  are  more  or  less  the  same  as  that  of  the  (4,4,4,4,4,4,12,12) 
connection,  despite  the  inclusion  of  some  5-module  processors.  For  example, 
Table  22  shows  the  results  for  two  other  connections,  (3,3,4,4,5,5,12,12) 
and  (3,3,3,5,5,5,11,13),  which  are  very   similar  to  the  first  connection. 
Apparently,  the  memory  sharing  of  a  large  processor  with  three  4-module 
processors  can  take  care  of  the  5-module  jobs  very   well.  But  most  impor- 
tantly, we  gain  some  confidence  that  these  results  are  indeed  in  a  reasonable 
region. 

Of  course,  the  job  size  distribution  is  a  very   important  factor 
to  the  performance  of  a  partial  connection.  As  we  said  earlier,  81.3%  of 
the  jobs  are  of  size  less  than  or  equal  to  four  memory  modules  which  is  the 
major  reason  why  the  (4,4,4,4,4,4,12,12)  connection  can  have  good  performance. 
However,  if  we  increase  the  size  of  each  job,  the  performance  of  this  con- 
nection might  degrade  rather  severely  since  more  jobs  now  have  to  enqueue 
for  those  two  large  processors.  In  Table  23,  we  show  some  results  of 
monoprogrammed,  partitioned  system  when  we  increase  the  job  size  by  25%  and 
50%.  We  can  see  the  turnaround  time  of  the  (4,4,4,4,4,12,12)  connection 
indeed  degrades  very   quickly.  It  increases  by  33%  when  we  increase  the  job 
size  by  25%.  In  the  last  row  of  that  table,  we  can  see  that  the  percen- 
tage of  the  jobs  with  sizes  less  than  or  equal  to  four  modules  is  now 
reduced  down  to  75.9%,  which  obviously  is  the  reason  of  the  performance 
degradation.  If  we  increase  the  job  size  by  50%,  we  can  see  the  turnaround 
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\\Connection 

Allocation 
Scheme   N. 

(4,4,4,4,4,4,12,12) 

* 
(3,3,4,4,5,5,12,12) 

** 
(3,3,3,5,5,5,11,13) 

Partitioned 

94 

95 

96 

Mi  xed 

97 

90 

94 

Distributed 

100 

101 

109 

Connection  Matrices 


1  1  1000000000000000000000 
0001  1  1000000000000000000 
0000001  1  1  1  00000000000000 
00000000001  1  1  10000000000 
000000000000001111100000 
00000000000000000001  1  1  1  1 
111000111100001111100000 
000111000011110000011111 


** 

1  1  1  000000000000000000000 
0001  1  1  000000000000000000 
0000001  1  1  000000000000000 
000000000111110000000000 
000000000000001111100000 
00000000000000000001  1  1  1  1 
oooi  11111111110000000000 
111000000000001111111111 


Table  22.     The  Average  Turnaround  Times  for  Two  Different  Connections 
(Monoprogramming,  with  All   Other  Parameters  of  Figure  39) 


186 


~^\^Job  Size  Scaling 
Connection  "^\^^ 

1.00 

1.25 

1.50 

(4,4,4,4,4,4,12,12) 

.94 

126 

237 

(3,3,4,4,5,5,12,12) 

95 

119 

162 

(2,2,5,5,5,5,12,12)* 

106 

128 

137 

%   of  Jobs  with  Sizes 
^  4  Memory  Modules 

81.3 

75.9 

58.6 

*     Connection  Matrix 


N 


1  1  0000000000000000000000 
001  100000000000000000000 
000011111000000000000000 
0000000001  1  1  1  1  0000000000 
000000000000001111100000 
00000000000000000001  1  1  1  1 
001111111111110000000000 
110000000000001111111111 


J 


Table  23.  The  Effect  of  Job  Size  on  the  Average  Turnaround 
Times  of  Three  Different  Connections. 
(  Monoprogramming,  Partitioned  Scheme  ) 


187 


time  increases  drastically  to  237  which  is  152%  higher.  This  is  because 
41.4%  of  the  job  now  requires  more  than  four  modules  of  memory.  Apparently, 
the  system  is  now  saturated  under  this  circumstance. 

If  we  use  (3,3,4,4,5,5,12,12)  connection,  i.e.,  assign  one  more 
module  to  each  of  the  fifth  and  sixth  processors,  we  can  see  the  situation 
is  much  better  when  we  increase  the  job  size.  It  only  degrades  by  70% 
if  we  increase  the  job  size  by  one-half.  In  fact,  we  can  see  in  Table  23, 
the  (2,2,5,5,5,5,12,12)  connection  gives  us  the  best  result  for  enlarged 
job  size.   It  degrades  just  30%  for  a  50%  increase  of  the  job  size.  In 
other  words,  using  more  5-module  processors  can  result  in  a  smaller 
degradation. 

Therefore,  how  to  assign  memory  modules  to  processors  really  de- 
pends on  how  the  job  size  distributes.  It  is  ^ery   difficult  to  formulate 
an  equation  and  try  to  solve  an  "optimal"  solution  of  a  partial  connection. 
The  only  rule  of  thumb  is  to  look  at  the  job  size  distribution  and  parti- 
tion the  memory  ports  so  that  enough  processors  can  have  sufficient  memory 
space  to  handle  most  of  the  jobs.  In  other  words,  try  to  assign  the 
memory  so  that  no  severe  bottleneck  will  be  created  at  any  processor.  For 
example,  before  we  scale  the  job  size,  four  memory  modules  will  be  suf- 
ficient for  a  processor  since  81.3%  of  the  jobs  are  smaller  than  or  equal 
to  four  memory  modules.  However,  when  we  increase  the  job  size,  we  need 
to  connect  more  modules  to  some  processors  so  they  can  handle  larger  jobs. 
Our  results  indeed  show  that  this  approach  is  generally  correct.  Of  course, 
some  of  the  processors  will  obtain  less  memory  because  the  total  number  of 
ports  is  fixed. 

For  more  than  two  ports  per  memory,  the  same  kind  of  approach 
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can  also  be  used,  except  each  processor  can  then  connect  to  more  memory 
modules.  The  memory  utilization  will  be  better,  and  hence  the  performance 
will  be  improved. 

One  more  interesting  thing  about  the  result  shown  in  Table  23  is 
that  the  performance  of  a  partial  connection  system  is  ^jery   sensitive  to 
the  increase  in  job  size.  For  example,  for  a  50%  increase,  the  average 
turnaround  times  of  these  three  connections  increase  by  152%,  70%,  and 
30%,  respectively.  If  we  refer  back  to  Table  12,  we  can  see  the  turnaround 
time  of  a  full  connection  system  only  degrades  15%  when  we  increase  the 
job  size  by  50%,  which  is  significantly  lower.  So,  when  we  are  planning 
to  use  a  partial  connection  network,  we  ought  to  be  wery   careful  about 
the  job  size  distribution  and  use  enough  memory  in  order  to  achieve  a 
satisfactory  level  of  performance. 

Finally,  let  us  redo  Figure  35  for  the  partial  connection  system, 
i.e.,  find  out  the  effect  of  system  size  on  the  system  performance. 
Figure  41  shows  how  the  average  turnaround  time  changes  when  we  double 
the  system  size.  The  solid  curves  are  using  an  arrival  rate  ratio  of  2, 
just  as  we  did  in  Figure  35.  We  again  use  the  (4,4,4,4,4,4,12,12)  connec- 
tion and  an  arrival  scaling  factor  of  0.1  for  the  (8,24,4)  system.  For 
the  (4,12,2)  system,  we  use  a  (4,4,4,12)  connection,  which  is  exactly  one- 
half  of  the  (4,4,4,4,4,4,12,12)  connection,  and  an  arrival  scaling  factor 
of  0.2.  For  the  (16,48,8)  system,  we  use  a  (4,4,4,4,4,4,4,4,4,4,4,4,12, 
12,12,12)  connection  and  an  arrival  scaling  factor  of  0.05.  So,  we  again 
double  the  workload  when  we  double  the  system  size.  As  we  can  see,  the  Ta 
curves  drop  quickly  when  we  double  the  system  size,  which  is  wery   similar 
to  that  in  Figure  35.  But,  the  interesting  thing  is  the  curve  of  the 
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Figure  41.  The  Average  Turnaround  Tine  for  Different  System 
Size.  (  Use  Partial  Connection  ) 
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distributed  system  now  is  most  sensitive  to  the  reduction  of  system  size. 
In  Figure  35,  however,  the  curve  of  the  partitioned  system  degrades  quick- 
est when  we  decrease  the  system  size. 

Again,  we  increase  the  arrival  scaling  factor  of  the  (4,12,2) 
system  in  order  to  lower  its  turnaround  time.  ^Jery   surprisingly,  when  we 
use  0.28,  which  is  the  same  scaling  factor  we  used  in  Figure  35,  the  turn- 
around time  of  the  (4,12,2)  system  drops  close  to  that  of  the  (8,24,4) 
system.  On  the  other  hand,  when  we  decrease  the  arrival  scaling  factor 

of  the  (16,48,8)  to  0.0395,  the  turnaround  time  becomes  roughly  the  same 

1  5 
as  that  of  the  (8,24,4)  system.  This  implies  that  our  2  '  conjecture 

still  holds  in  a  partial  connection  system.  Of  course,  the  result  we  get 
here  is  based  on  one  particular  set  of  connections.  Although  the  per- 
formance of  a  partial  connection  system  is  very   sensitive  to  the  connec- 
tion we  use,  at  least  we  know  it  is  possible  to  connect  the  system  so  that 

1  5 
a  double-sized  system  can  carry  a  workload  2    times  of  the  workload  of 

the  original  system.  Hence,  we  now  gain  more  confidence  in  this  simple 

conjecture. 
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Chapter  4 
CONCLUSION 


4.1  Summary 

In  the  last  chapter,  we  have  discussed  several  interesting  problems 
about  the  design  of  a  multiprocessor  system.  We  talked  about  the  performance 
of  multiprogramming  and  monoprogramming  schemes,  the  advantages  and  disadvan- 
tages of  three  different  memory  allocation  schemes,  the  effects  of  job 
parameters  and  hardware  characteristics,  and  the  difference  between  using 
a  full  connection  network  and  a  partial  connection  network.  We  will  briefly 
summarize  these  results  in  this  section. 

Tables  24-27  summarize  and  compare  the  performances  of  six  system 
combinations  under  both  full  and  partial  connections.  Each  table  will  show 
the  comparison  of  one  performance  parameter.  It  is  rather  difficult  to  order 
the  performance  of  various  systems,  so  we  will  only  use  {bad,  fair,  good, 
best}  or  {high,  moderate,  low}  to  indicate  their  relative  performances. 
However,  we  do  point  out  the  system  which  yields  the  best  results  in  order 
to  give  the  reader  an  idea  which  system  might  be  the  best  choice  for  each 
area  of  performance. 

Table  24  shows  the  comparison  of  the  average  turnaround  time.  Of 
course,  the  turnaround  time  of  a  monoprogrammed  system  depends  heavily  on  the 
number  of  processors.  We  will  assume  that  there  are  enough  processors,  say  8, 
in  the  system.  Under  a  full  connection,  the  distributed,  multi programmed 
system  has  the  best  turnaround  time.  As  we  explained  earlier,  this  is  caused 
by  high  memory  bandwidth  and  high  memory  utilization.  The  distributed,  mono- 
programmed  system  has  the  next  best  turnaround  time.  Obviously,  this  is 
because  we  are  assuming  eight  processors  in  the  system.  If  we  assume  fewer 
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"""-^^^^  Connection 
System    ^-^-^^ 

Full 

** 

Partial 

Partitioned 

Bad 

Good 

Mono- 
programming 

Mi  xed 

Good 

Good 

Distributed 

* 
Near  Best 

Good 

Partitioned 

Fair 

Fair 

Multi- 
programming 

Mixed 

Good 

Fair 

Distributed 

Best 

Bad 

*   Assuming  8  Processors 
**  Assuming  2 -Port  Memory 


Table  24.  Comparison  of  the  Average  Turnaround  Time. 
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processors,  the  turnaround  time  will  degrade  a  little  until  we  reduce  the 
number  of  processors  below  four  (cf.  Figure  32).  So,  both  distributed  sys- 
tems perform  very   well  if  we  use  a  full  connection  network.  Both  mixed  sys- 
tems also  have  good  turnaround  times  but  are  worse  than  the  distributed 
systems.  This  is  due  to  a  smaller  memory  bandwidth  produced  by  the  mixed 
scheme.  On  the  other  hand,  the  partitioned  systems  both  perform  even  worse 
than  the  mixed  systems,  although  not  by  much.  This  is  caused  by  bad  memory 
utilization  and  memory  waste  of  the  partitioned  system.  Overall,  a  multi- 
programmed  system  is  slightly  better  than  its  monoprogrammed  counterpart, 
and  the  distributed  scheme  yields  the  best  result.  The  performance  of  a  full 
connection  system  is  essentially  determined  by  the  memory  utilization  and 
the  memory  bandwidth. 

Under  a  partial  connection,  however,  the  whole  situation  is 
reversed.  All  the  monoprogramming  results  are  better  than  the  multiprogram- 
ming results  when  the  number  of  ports  is  reduced  to  two.  As  we  explained  in 
the  last  chapter,  the  major  reason  is  the  queueing  time  for  processors  created 
in  the  partial  connection,  multi programmed  system.  But  in  a  partial  connec- 
tion, monoprogrammed  system,  there  is  no  queueing  time  for  processors  since 
each  job  is  assigned  a  dedicated  processor.  We  can  see  that  the  partitioned 
scheme  yields  the  best  result  in  a  partial  connection,  monoprogrammed  system. 
The  interesting  thing  is  that  it  has  the  worst  performance  in  a  full  connec- 
tion, monoprogrammed  system.  So,  the  connection  has  really  changed  the 
result.  The  mixed,  monoprogrammed  system  also  shows  yery   good  performance 
which  is  similar  to  the  partitioned,  monoprogrammed  system.  In  fact,  all 
monoprogrammed  results  are  \/ery   close  to  each  other.  We  can  see  this  from 
the  results  shown  in  Section  3.2.4.  On  the  other  hand,  all  the  multi programmed 


194 


systems  perform  relatively  poorly  when  we  use  a  partial  connection. 

The  most  important  result  is,  if  we  properly  interconnect  the 
processors  and  memories,  the  performance  degradation  of  a  partial  connection, 
monoprogrammed  system  can  be  kept  to  within  10  to  20%  of  the  performance 
of  a  full  connection  system.  For  example,  we  have  shown  that  the 
(4,4,4,4,4,4,12,12)  connection  only  creates  10.6%  of  degradation  on  the 
turnaround  time  of  the  partitioned,  monoprogrammed  system.  This  not  only 
encourages  us  to  use  partial  connection  since  it  is  more  cost-effective,  but 
also  makes  the  partitioned  scheme  and  monoprogramming  more  attractive  in 
operating  system  design. 

Table  25  shows  the  comparison  of  total  memory  bandwidth,  i.e., 
the  memory  bandwidth  generated  by  all  active  processors.  Under  a  full  con- 
nection, the  distributed  scheme  can  yield  relatively  high  memory  bandwidth 
due  to  the  high  degree  of  interleaving  each  job  can  enjoy.  The  memory  band- 
widths  of  both  the  partitioned  and  mixed  schemes  are  lower  than  that  of  the 
distributed  scheme  since  now  each  job  is  confined  in  part  of  the  memory  and 
the  degree  of  interleaving  has  been  reduced.  As  we  said,  this  is  the  major 
reason  why  the  distributed  system  can  have  better  turnaround  time  than  the 
other  two  systems.  However,  if  we  use  faster  memory,  i.e.,  reduce  the  value 
of  s,  the  difference  between  the  bandwidths  of  the  distributed  and  parti- 
tioned systems  will  be  reduced.  Their  turnaround  times  hence  become  closer 
to  each  other.  This  can  be  seen  in  Figure  36.  Of  course,  the  total  memory 
bandwidth  of  a  mul ti programmed  system  is  higher  than  that  of  its  monoprogrammed 
counterpart  since  the  mul ti prog  rammed  system  on  the  average  can  contain  more 
active  jobs  in  the  memory. 

Under  a  partial  connection,  the  total  memory  bandwidth  of  every 
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Systerr 

Connection 
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Partitioned 

Low 

Low 

Mono- 
programming 

Mixed 

Moderate 

Moderate 

Distributed 

High 

Moderate 

Partitioned 

* 

Moderate 

Low 

Multi- 
programming 

Mi  xed 

High 

Moderate 

Distributed 

Highest 

Moderate 

*  Assuming  Large  m 
(  Low  for  Smal 1  m  ) 


Table  25.  Comparison  of  the  Total  Memory  Bandwidth. 
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system  will  decrease.  This  is  due  to  the  decrease  of  memory  utilization. 
For  a  distributed  system,  the  memory  bandwidth  has  been  further  decreased  by 
the  reduction  of  the  degree  of  interleaving  since  a  job  can  no  longer  be 
interleaved  across  the  whole  memory  in  a  partial  connection.  Now,  the  total 
memory  bandwidths  of  the  mixed  and  distributed  systems  are  similar  because 
they  have  similar  capability  of  containing  jobs  and  similar  degree  of  inter- 
leaving for  each  job. 

One  thing  we  need  to  explain  is  the  total  memory  bandwidth  of  a 
partial  connection  system.  Intuitively,  a  multi programmed  system  should 
have  a  higher  total  memory  bandwidth  than  its  monoprogrammed  counterpart. 
This  is  because  a  multi programmed  system  can  allow  more  jobs  in  the  memory 
at  the  same  time  which  can  cause  a  higher  utilization  of  processors.  How- 
ever, our  simulation  result  shows  that  a  monoprogrammed  system  has  almost 
the  same  total  memory  bandwidth  as  a  mul tiprogrammed  system  if  we  use  partial 
connection.  This  rather  surprising  result  actually  is  not  difficult  to 
explain.  As  we  said  in  Section  3.2.3,  the  major  factor  that  makes  a  partial 
connection,  mul tipgorammed  system  have  a  worse  turnaround  time  than  its 
monoprogrammed  counterpart  is  the  queueing  time  for  processors  (see  Table  20) 
This  queueing  time  is  caused  by  the  fact  that  in  a  partial  connection  system 
ewery   job  can  only  be  executed  by  a  few  processors  which  connect  to  all  the 
memory  modules  this  job  is  in.  In  other  words,  a  job  will  have  to  wait  if 
the  processors  that  can  handle  it  are   all  busy,  even  though  some  other 
processors  are  free.  This  queueing  phenomenon  essentially  reduces  the  number 
of  jobs  that  can  be  executed  at  the  same  time.  Our  simulation  result  shows 
that  while  there  are  more  jobs  in  a  multiprograirmed  system,  on  the  average 
both  systems  have  approximately  the  same  number  of  jobs  in  execution.  The 
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extra  jobs  in  the  multi programmed  system  are  either  in  the  processor  queue 
or  in  the  I/O  stage.  Since  the  numbers  of  jobs  in  execution  are  roughly  the 
same,  both  systems  thus  have  similar  total  memory  bandwidth. 

However,  we  know  the  total  memory  bandwidth  represents  the  amount 
of  work  a  system  can  do  in  one  memory  cycle  time.  It  actually  indicates  how 
good  the  system  throughput  is,  if  we  consider  the  system  throughput  as  the 
number  of  jobs  that  get  done  in  a  certain  unit  of  time.  The  higher  the  total 
memory  bandwidth  is,  the  faster  the  jobs  can  be  done,  and  hence  the  higher 
the  system  throughput  will  be.  So,  the  partial  connection,  multiprogrammed 
system  should  have  the  same  throughput  as  its  monoprogrammed  counterpart 
does.  Our  simulation  result  indeed  shows  this  since  both  of  them  have  very 
similar  total  elapsed  time,  i.e.,  the  total  amount  of  time  to  finish  all 
the  jobs.  In  other  words,  the  use  of  multiprogramming  does  not  improve  the 
system  throughput  if  we  are   using  a  partial  connection. 

As  we  said,  the  partial  connection,  multiprogrammed  system  has  a 
worse  average  turnaround  time  due  to  the  occurrence  of  the  queueing  time  for 
available  processors.  Yet,  it  has  the  same  throughput  as  the  partial  connec- 
tion, monoprogrammed  system,  which  means  it  can  finish  the  jobs  at  the  same 
rate.  If  we  look  at  the  input  and  output  of  both  systems,  we  can  see  both 
systems  have  the  same  arrival  and  departure  rates.  The  only  fact  that  can 
cause  the  difference  of  the  average  turnaround  times  is  apparently  the  order 
these  jobs  get  done.  Although  we  are  using  the  same  scheduling  algorithm 
in  both  systems  to  select  jobs  for  execution,  the  queueing  time  for  availa- 
ble processors  in  a  partial  connection,  multiprogrammed  system  might  delay 
the  execution  of  some  jobs  and  allow  some  other  jobs  to  be  processed  faster. 
For  example,  a  job  that  comes  into  the  memory  first  and  requires  smaller 
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CPU  time  than  any  other  job  in  the  memory  might  not  be  finished  first  if 
it  has  to  compete  for  a  processor  with  some  other  jobs  all  the  time.  In  a 
monoprogrammed  system,  this  will  not  happen  since  the  time  a  job  will  finish 
once  it  eneters  the  memory  solely  depends  on  its  CPU  and  I/O  time  require- 
ments. In  other  words,  the  effect  of  the  scheduling  algorithm  will  be  re- 
duced by  the  queueing  delay  in  a  multiprogrammed  system.  This  is  why  a  par- 
tial connection,  monoprogrammed  system  has  a  better  average  turnaround  time. 

Therefore,  the  average  turnaround  cannot  always  tell  us  how  fast 
the  system  is  doing  work.  Only  the  total  memory  bandwidth  can  indicate  how 
good  the  system  throughput  is. 

Let  us  now  summarize  the  memory  utilization  of  these  systems  in 
Table  26.  As  we  can  see,  the  multiprogrammed  systems  have  better  memory 
utilization  than  the  monoprogrammed  systems,  and  the  full  connection  systems  have 
better  memory  utilization  than  the  partial  connection  systems.  This  is  what  we 
would  expect.  From  all  the  data  we  collected,  the  mixed  and  distributed 
systems  both  show  rather  similar  memory  utilizations.  This  is  because  both 
systems  have  the  same  capability  of  containing  jobs  in  the  memory,  as  we  have 
mentioned  several  times.  The  partitioned  system,  however,  has  a  significantly 
lower  memory  utilization,  which  is  caused  by  the  memory  waste  created  during 
the  whole-module  allocation  process.  This  is  the  major  reason  why  the 
partitioned  system  yields  the  worst  turnaround  time  when  we  are  using  a  full 
connection  network. 

Of  course,  a  partial  connection  system  has  lower  memory  utilization 
than  a  full  connection  system  since  only  some  of  the  processors  are  connected 
to  a  memory  module.  So,  if  all  the  processors  connected  to  a  certain  module 
are  busy,  then  the  unused  portion  of  this  module  will  be  wasted.  Or,  if  the 
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Table  26.  Comparison  of  the  Memory  Utilization 
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unoccupied  memory  is  split  between  several  processors  and  none  of  these 
processors  has  by  itself  large  enough  space  for  the  next  job,  then  the 
unoccupied  memory  again  will  be  wasted.  This  is  what  causes  the  performance 
to  degrade.  Fortunately,  if  we  can  use  a  good  connection  by  considering 
the  job  size  distribution,  it  is  possible  to  keep  the  memory  waste,  and 
hence  performance  degradation,  very   low. 

The  other  performance  parameter  we  often  mentioned  in  the  last 
chapter  is  the  job  memory  bandwidth,  which  is  the  bandwidth  each  processor 
gets  to  execute  a  job.  It  is  obtained  by  dividing  the  total  memory  band- 
width by  the  number  of  jobs  in  memory.  As  it  turns  out,  the  job  memory  band- 
width of  the  mixed  and  partitioned  systems  are  only  affected  by  the  proces- 
sor-memory speeed  ratio  and  the  number  of  memory  modules  m.  They  remain 
essentially  unchanged  when  we  change  the  other  system  parameters.  This  is 
not  surprising,  since  under  these  two  schemes  most  or  all  of  a  job  is  iso- 
lated and  prevented  from  the  influences  of  the  other  jobs.  Once  a  job  gets 
into  the  memory,  the  speed  ratio  will  affect  the  memory  bandwidth  this  job 
can  get  since  s  determines  the  number  of  requests  the  processor  can  generate 
per  memory  cycle.  On  the  other  hand,  m  will  determine  the  degree  of  inter- 
leaving for  a  job  which  also  affects  the  bandwidth.  However,  the  job  memory 
bandwidth  of  the  distributed  system  will  be  affected  by  almost  eyery   system 
parameter.  Of  course,  it  will  be  affected  by  the  speed  ratio  s  and  the 
number  of  memory  modules  m.  In  addition,  it  will  also  be  affected  by 
parameters  like  job  arrival  rate,  the  average  amount  of  time  per  I/O  operation, 
the  average  number  of  I/O  requests  per  job,  monoprogramming  or  multiprogram- 
ming, and  so  on.  All  these  parameters  will  affect  the  number  of  jobs  in 
execution  at  the  same  time,  which  in  turn  will  affect  the  memory  bandwidth 
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due  to  mutial  interference.  Furthermore,  the  connection  network  also  has 
a  very  significant  effect  on  the  job  memory  bandwidth.  As  we  said  before, 
a  partial  connection  might  drastically  reduce  the  degree  of  interleaving  of 
a  job  and  will  seriously  decrease  the  job  memory  bandwidth. 

Overall,  a  full  connection  distributed,  monoprogrammed  system  will 
yield  the  highest  job  memory  bandwidth.  This  can  be  seen  in  Table  7  where 
we  show  some  numerical  values  of  the  job  memory  bandwidth. 

In  Table  27,  we  list  the  system  which  produces  the  best  result  in 
each  performance  area,  under  either  a  full  or  a  partial  connection.  If  we 
use  full  connection,  the  distributed,  multi programmed  system  shows  the 
best  turnaround  time,  the  largest  total  memory  bandwidth,  and  the  highest 
memory  utilization.  Only  the  distributed,  monoprogrammed  system  displays 
the  best  job  memory  bandwidth.  On  the  other  hand,  if  we  use  partial  connec- 
tion, the  partitioned,  monoprogrammed  system  now  shows  the  best  turnaround 
time.  Both  the  distributed,  multiprogrammed  system  and  the  distributed, 
monoprogrammed  system  display  the  best  total  memory  bandwidth.  The  best 
memory  utilization  and  the  best  job  memory  bandwidth  are  still  obtained  by 
using  the  distributed  (or  mixed),  multiprogrammed  system  and  the  distributed, 
monoprogrammed  system  respectively. 

The  full  connection,  distributed,  multiprogrammed  system  seems  to 
be  a  better  choice,  since  it  gives  the  minimum  turnaround  time.  However,  the 
partial  connection,  partitioned,  monoprogrammed  system  is  more  cost-effective 
Especially  when  the  system  size  is  large,  the  use  of  a  partial  connection 
network  can  reduce  the  system  cost  significantly.  Moreover,  a  partially 
connected  system  is  easier  to  maintain  and  expand.  While  we  are  adding  or 
deleting  a  memory  module  or  a  processor,  only  very   few  connections  have  to 
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be  altered,  and  the  rest  of  the  system  can  be  kept  untouched  and  go  on 
operating.  So,  a  partial  connection  system  also  has  the  advantage  of  high 
availability  and  expandability. 

All  the  performance  measures,  in  particular  the  turnaround  time, 
are  very   sensitive  to  the  job  mix  we  are  using.  We  have  shown  the  turnaround 
time  and  the  memory  utilization  will  increase  rather  rapidly  when  we  increase 
the  arrival  rate,  job  size  distribution,  or  the  I/O  time.  The  reason  is, 
these  parameters  can  easily  push  the  system  into  saturation.  Therefore, 
when  we  are  designing  a  system,  we  should  carefully  study  the  job  mix  we 

are  deal ing  with. 

1  5 
One  of  our  most  interesting  results  is  the  2  '  workload  relation- 
ship between  two  systems  that  have  a  size  ratio  of  2.  Our  simulation  shows 

that,  when  we  double  the  system  size,  we  can  handle  2.7  to  2.8,  or  roughly 

1  5 
2  '  ,  times  the  original  workload.  This  is  true  for  both  the  full  and  partial 

connection  systems.  So,  our  conjecture  is,  the  system  size  C  (or  the  cost) 

1  5 
and  the  workload  it  can  handle  P  (or  the  processing  power)  maintain  a  P  =  a  C 

relationship.  Of  course,  this  conjecture  has  been  shown  to  hold  only  for 

systems  of  size  up  to  16  processors  and  48  memory  modules.  As  we  said  in 

Section  3.2.3,  this  factor  will  be  reduced  to  about  2.3  when  we  double  the 

system  size  again  to  (32,  96,  16).  So,  we  believe  the  improvement  factor 

of  2.8  would  approach  2  as  the  system  gets  \/ery   large. 

4.2  Some  Design  Problems 
4.2.1  Address  Interleaving 

As  we  said  in  the  last  section,  a  partial  connection,  partitioned, 
monoprogrammed  system  is  the  most  cost-effective  choice  of  system  design. 
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However,  we  pointed  out  that  we  will  need  a  new  scheme  of  generating  phy- 
sical addresses  if  we  want  to  use  interleaving  to  get  the  best  possible 
memory  bandwidth.  Of  course,  when  the  memory-processor  speed  ratio  s  is 
small,  say  1  or  2,  there  will  not  be  much  difference  whether  we  use  inter- 
leaving or  not.  For  s=2,  even  if  we  store  a  program  vertically  inside  a 
memory  module,  quite  often  we  might  still  be  able  to  access  more  than  one 
word  if  the  data  and  the  instruction  we  can  fetch  simultaneously  are  in 
different  modules.  If  we  use  interleaving,  i.e.,  store  a  program  horizon- 
tally across  several  memory  modules,  we  might  only  get  a  little  better 
chance  of  accessing  two  words  without  conflict.  So,  it  might  not  be  worth 
it  to  implement  the  interleaving  scheme  when  s  is  small.  When  s  is  larger, 
however,  the  interleaving  scheme  will  show  a  much  better  bandwidth  since 
several  instructions  and  data  can  be  accessed  at  the  same  time.  It  is  more 
desirable  to  use  interleaving  under  this  circumstance. 

If  we  use  interleaving  in  a  partitioned  system,  two  problems  will 
arise  that  make  the  generation  of  physical  addresses  wery   tough.  First, 
the  number  of  memory  modules  allocated  to  a  job  is  variable  depending  on 
the  size  of  this  job.  This  implies  that  the  degree  of  interleaving  will 
be  different  for  each  job.  Consequently,  a  processor  must  be  able  to  ad- 
just its  address  mapping  mechanism  to  cope  with  the  changing  degree  of  inter- 
leaving. Second,  the  modules  a  job  gets  in  general  will  be  scattered  all 
over  the  memory  and  might  not  be  adjacent  to  each  other.  Therefore,  it 
will  cause  some  trouble  to  locate  an  instruction  or  operand.  If  we  horizon- 
tally interleave  a  program,  the  next  instruction  might  be  several  modules 
away  from  the  current  instruction.  So,  we  will  not  be  able  to  get  the 
address  of  the  next  instruction  by  simply  adding  one  to  the  current  module 
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number.  Apparently,  we  need  more  hardware  and  a  new  algorithm  in  the  in- 
struction decoding  unit  of  a  processor  in  order  to  generate  a  physical  ad- 
dress properly.  Let  us  propose  a  simple  and  feasible  design  which  can  solve 
this  problem. 

Figure  42  shows  the  logic  diagram  of  the  design  we  are  proposing. 
The  logical  address  register  contains  the  logical  address  we  want  to  trans- 
form and  the  final  physical  address  will  be  in  the  physical  address  register. 
The  hardware  between  them  is  used  to  do  the  transformation.  The  physical 
address  consists  of  two  parts,  namely,  a  module  number  x  and  an  in-module 
word  address  w,  which  will  be  obtained  by  the  following  process. 

First,  let  us  point  out  one  thing  which  will  affect  the  way  we 
interleave  a  program,  and  hence  affect  the  memory  bandwidth.  Assume  that 
the  program  counter  is  L  bits  long.  If  we  interleave  a  program  successively 
into  all  the  memory  modules,  say  c  of  them,  we  must  be  able  to  perform  a 
"quotient-remainder  of  c"  operation  on  these  L  bits,  called  QR  (L),  in  order 
to  find  the  module  and  word  corresponding  to  this  address.  However,  c  is 
a  variable  which  is  determined  by  the  size  of  the  job  currently  running  on 
this  processor.  This  means  eyery   processor  should  be  provided  the  hardware 
to  perform  the  QR  operations  for  all  possible  c  values  if  we  want  to  inter- 
leave a  program  in  a  normal  way.  This  is  not  economically  attractive  since 
it  implies  that  we  must  build  several  QR  circuits  inside  each  processor  for 
address  decoding.  Therefore,  we  must  seek  some  other  method  to  interleave 
a  program. 

The  method  we  suggest  is  the  following:  If  a  job  requires  a  power 
of  two  modules  or  some  number  of  modules  for  which  QR  hardware  exists,  then 
we  will  interleave  the  program  in  the  normal  way.  Otherwise,  we  will 
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Figure  42.     Address  flapping  for  Partitioned  System. 
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partition  the  modules  into  a  power  of  two  groups  each  having  the  same 
number  of  modules  such  that  QR  hardware  exists  for  this  group  size.  For 
example,  assume  the  processor  has  the  hardware  to  do  a  QFU  operation  and  a 
job  requires  six  modules.  We  will  partition  the  modules  into  two  groups  with 
three  modules  in  each  group.  However,  if  the  number  of  modules  a  job  needs 
is  not  a  multiple  of  3,  we  will  have  to  grant  the  job  some  extra  memory  to 
make  the  number  a  multiple.  Now,  we  will  use  the  last  g  bits  of  the  logical 
address  to  determine  the  proper  group.  These  g  bits  are  called  "Group  Bits." 
If  there  are  only  two  groups,  g  will  be  1 .  In  so  doing,  we  actually  achieve 
a  double  interleaving,  i.e.,  we  not  only  interleave  the  successive  addresses 
into  different  modules,  but  also  interleave  them  into  different  groups. 
Figure  43  shows  the  result  of  interleaving  a  6-module  job,  if  we  use  the 
last  bit  to  indicate  the  group.  We  will  call  this  a  3-3  interleaving.  The 
important  thing  is  that  this  method  still  allows  us  to  fully  interleave  a 
program  across  all  the  modules  even  though  we  do  not  have  the  appropriate 
QRg  circuit. 

The  first  (left)  L-g  bits  of  the  program  counter  will  be  fed  into 
a  shift  register  where  we  will  perform  the  QR  operation.  Since  g  is  a 
variable  which  depends  on  the  number  of  groups  we  form  for  the  job,  we  need 
a  shift  register  here  in  order  to  shift  the  logical  Address  g  bits  to  the 
right.  Of  course,  if  g  is  zero  the  whole  content  of  the  logical  Address 
register  will  be  gated  into  the  shift  register  without  shifting.  So,  the 
shift  register  must  also  be  L  bit  long.  We  now  perform  the  QR  operation  on 
the  content  of  the  shift  register.  The  remainder  of  this  operation  will 
tell  us  the  correct  module  within  a  group.  This  remainder  together  with 
the  group  bits  will  tell  us  the  logical  module  number  we  are  looking  for. 
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Figure  43.     The  Interleaving  of  a  6-Module  Job, 
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The  quotient  will  give  us  the  correct  address  inside  the  module. 

Now,  let  us  describe  how  we  do  the  QR~  operation.  Of  course,  we 
can  use  a  combinational  circuit  to  perform  the  operation.  For  example, 
Gajski  and  Vora  [49]  have  a  very   nice  design  of  modulo  3  circuit.  However, 
it  might  take  a  large  number  of  gates  to  implement  the  circuit  when  L  is 
large,  and  we  also  need  to  determine  the  quotient.  So,  we  choose  another 
design  using  read-only-memories  (ROM).  If  L  is  small,  say  10  to  12,  we  can 
use  one  ROM  to  find  out  the  remainder  and  quotient  of  3.  However,  we  are 

using  1024K  bytes  of  memory  in  our  system,  which  means  L  is  about  18  to  20. 

20 
We  then  have  to  use  a  very   large  ROM  with  bits  in  the  order  of  2  !  This 

is  apparently  very  expensive.  So,  we  will  use  the  design  shown  in  Figure  42, 
which  requires  seven  ROM's  of  reasonable  size,  one  small  integer  adder  and 
an  incrementer.  Of  course,  we  have  to  spend  a  little  longer  time  to  do  the 
operation. 

In  fact,  our  design  is  good  for  any  base.  Let  us  use  c  to  repre- 
sent the  base  of  the  QR  operation.  We  first  break  the  shift  register  into 
two  equal  halves  each  having  n  =  L/2  bits.  The  contents  of  the  right  and 
left  halves  will  be  called  a  and  b  respectively.  In  other  words,  the  con- 
tent of  the  shift  register  can  be  expressed  as  a  +  b2n.  Also,  let  us  further 
decompose b to  uc+v,  i.e.,  let  b  =  uc+v.  The  remainder  and  quotient  of  the 
QR  operation  can  be  found  to  be: 

remainder  =  (a+b2  )  mod  c 


=  [a  mod  c  +  (b2n)  mod  c]  mod  c 
=  [r,  +  r2]  mod  c, 
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and  quotient     =     (a+b2   )   f  c 


L|J  ♦  Lf  J  ♦  L-VJ 

l|j  +  u2"  ♦  L^f-J  ♦  L^J  , 


wh 


ere  r,  =  a  mod  c,  r~   =  (b2  )  mod  c,  *  represents  integer  division,  and  |_  J 


is  the  floor  function.  ROM  2  contains  theresultof  r,  =  (a  mod  c)  and  ROM  5 
contains  the  result  of  r?  =  |_(b2  )  mod  cj.  The  outputs  of  these  two  ROM's 
will  be  used  as  the  address  to  ROM  7,  which  contains  the  result  of  {r,+r?) 
mod  c.  The  output  of  ROM  7,  i.e.,  the  remainder,  coupled  with  the  group 
bits  will  tell  us  the  logical  module  number.  Each  word  of  ROM  1  contains 
a  result  of  [-J  together  with  a  bit  that  tells  whether  a  is  all  one's  or  not. 
Hence,  all  the  words  in  ROM  1  have  zero  in  the  last  bit  position  except  the 
last  word.  The  use  of  this  bit  will  be  explained  very  soon.  ROM  3  contains 

the  result  of  u  which  will  be  shifted  to  the  left  by  n  bits.  The  third  term 

i  v2n  i 
on  the  right  hand  side  of  the  quotient  equation,  i.e.,  |_ J,  will  be  called 

a  "Corrector"  term.  Since  v  has  only  c  possible  values,  the  number  of  pos- 
sible values  for  the  corrector  term  is  at  most  c.  If  c  is  small,  the  cor- 
rector term  can  only  have  a  few  possible  values.  Therefore,  we  can  either 
hardwire  them  or  use  a  small  number  of  registers  to  store  them.  ROM  4  will 
be  used  to  select  the  corrector  value  we  should  use.  The  outputs  of  ROM  2 
and  ROM  5,  i.e.,  r,  and  r?,  will  also  be  fed  into  ROM  6,  which  gives  the 

rl+r2  irl+r2i 

result  of  [- J-  It  is  easy  to  see  that  0  £  ri+r?  <  2c,  so  L — ? — •*  1S 

either  0  or  1 .  Hence,  the  output  of  ROM  6  is  only  1  bit  long  which  can  be 

used  as  the  carry-in  to  the  adder. 

v?n       rl+r2 
The  adder  is  only  n  bits  long  which  adds  [-J,  l—r- J»  anc'  L — r — -• 
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together.  These  three  terms  are  n-1,  n,  and  1  bits  long  respectively. 
The  fourth  terir.,  i.e.,  u2  ,  wll  be  fed  into  an  incrementer,  which  will 
increment  by  one  if  the  adder  generates  a  carry.  The  reason  we  use  an  n-bit 
adder  and  an  n-bit  incrementer  instead  of  a  2n-bit  adder  is  because  the  sum- 
ming of  these  four  terms  is  rather  special.  Figure  44  shows  the  length  of 
each  term  and  how  they  align  when  they  are  summed  together.  As  we  can  see, 
the  only  chance  u  will  be  affected  is  when  the  other  three  terms  produce  a 
carry.   So,  we  only  need  an  incrementer  for  the  left  n  bits.  In  fact,  the 
carry  can  be  generated  in  advance  by  using  the  last  bit  of  ROM  1, 

rl+r2 
L J,  and  the  output  of  ROM  4,  so  we  do  not  have  to  wait  for  propagation 

delay  of  the  adder.  This  can  be  shown  by  the  following  example. 

Let  us  assume  c=3  and  n=10.  |__J  will  be  9  bits  long.  Since  v  can 

only  be  0,  1,  or  2,  the  output  of  ROM  4  is  only  2  bits  long.  The  corrector 

term  can  be  shown  to  be  one  of  the  following  three  values:  namely, 

0000000000,  01 01 01 01 01,  or  1010101010  (all  in  base  2).  The  last  word  of 

ROM  1  contains  101010101  which  is  the  largest  value  of  [— J  and  is  the  only 

word  that  will  generate  a  carry  provided  the  corrector  term  is  1010101010 

and  I— — -J  =  1.  The  last  bit  of  ROM  1  will  tell  whether  |_-J  is  101010101 
c  Lc 

or  not  and  the  output  of  ROM  4  can  tell  whether  the  corrector  term  is 
1010101010  or  not.  So,  we  can  AND  them  together  to  see  if  we  should  incre- 
ment u  or  not.  The  results  of  the  adder  and  incrementer  will  be  combined 
to  form  the  quotient  we  are  looking  for. 

The  size  of  ROM  1  is  2n  x  n  bits,  the  sizes  of  ROM  2,  ROM  4, 

Tlog?cl 
ROM  5,  and  ROM  7  are  all  2    L       x  pog2c]  bits,  the  size  of  ROM  3  is 

2n  x  (n-1)  bits,  and  ROM  6  is  only  2n  x  1  bit.  In  fact,  ROM  6  and  ROM  7  can 
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rl+r2 
c 

Figure  44.  The  Length  of  Each  Component  in 
Quotient  Computation. 
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be  put  together  in  one  ROM.  If  L  is  20,  n  is  10,  and  c  is  3,  we  need  20 

lKxl  bit  ROM's  and  one  16x4  bit  ROM.  So,  the  hardware  is  actually  very   cheap. 

We  can  see  from  Figure  42  that  it  takes  at  most  two  ROM  cycles 
and  one  addition  cycle  to  do  a  QR  operation.  However,  this  is  in  general 
less  than  most  of  the  arithmetic  operation  times.  So,  our  design  will  not 
cause  any  serious  problem  to  the  address  decoding. 

As  we  mentioned  earlier,  the  memory  modules  a  job  gets  might  be 
scattered  all  over  the  memory,  and  they  might  not  be  adjacent  to  each  other. 
So,  we  need  to  transform  the  logical  module  number  we  obtained  above  to  the 
physical  module  number.  This  is  what  the  Module  Mapping  Table  shown  in 
Figure  42  will  do.  The  Module  Mapping  Table  is  in  fact  a  cache  memory.  When 
we  allocate  memory  to  a  job,  we  will  record  the  physical  module  numbers  in 
the  table  in  the  correct  order,  which  can  be  retrieved  later.  Hence,  the 
Module  Mapping  Table  acts  just  like  the  page  table  used  in  a  paging  system. 
After  the  physical  module  number  has  been  retrieved,  it  will  be  appended  to 
the  in-module  word  address  to  form  the  final  physical  address. 

Of  course,  a  job  that  requires  some  power  of  two  modules  does  not 
need  the  QR  operation  since  the  module  number  and  the  in-module  word 
address  can  be  obtained  simply  by  breaking  the  logical  address  into  two 
parts,  the  lower  8  bits  (cf.  Figure  42)  will  indicate  the  module  number 
which  will  be  used  by  the  Module  Mapping  Table  to  retrieve  the  physical 
module  number.  The  upper  L-&    bits  will  be  gated  directly  to  the  physical 
address  register.  The  two  multiplexors  (MUX's)  in  Figure  42  are  used  to 
decide  which  result  should  be  used. 

The  only  drawback  with  this  scheme  is  sometimes  we  have  to  waste 
some  memory  in  order  to  make  this  scheme  work.  For  example,  if  we  can  only 
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do  QR~  operation  and  a  job  requires  five  modules,  we  must  allocate  six  modules 
to  this  job  and  use  the  interleaving  scheme  shown  in  Figure  43.  Table  28 
shows  the  actual  number  of  modules  each  job  will  be  allocated  and  the 
interleaving  scheme  to  be  used.  As  we  can  see,  5-module  and  7-module  jobs 
will  be  granted  six  modules  and  eight  modules  respectively.  Obviously,  this 
deisgn  will  further  increase  the  memory  waste  originally  present  in  a 
partitioned  system.  The  situation  is  even  worse  for  9-module,  10-module, 
and  11-module  jobs  since  all  of  them  must  be  allocated  12  modules  to  use 
3-3-3-3  interleaving,  i.e.,  the  modules  will  be  partitioned  into  four  groups 
with  three  modules  in  each  group.  We  cannot  use  3-3-3  interleaving  for 
9-module  jobs  since  we  need  a  power  of  two  groups  in  order  to  use  the  last 
few  bits  as  the  group  bits.  (Recall,  however,  Table  21  which  indicates 
that  for  our  job  mix,  yery   few  jobs  require  more  than  six  modules.) 

If  we  also  implement  a  QRr  circuit  in  the  processor,  we  can  im- 
prove the  situation  stated  above.  Especially,  the  5-module  jobs  will  no 
longer  need  an  extra  module  in  order  to  use  the  3-3  interleaving.  In 
Table  29,  we  show  the  result  with  the  addition  of  a  QR5  circuit. 

After  the  generation  of  a  physical  address,  the  processor  will 
send  it  to  all  the  memory  modules  attached  to  its  processor  bus.  Inside 
each  module,  there  should  be  some  identifying  hardware  so  that  the  destina- 
tion module  will  pick  up  the  address  and  other  modules  will  reject  the  ad- 
dress. This  can  be  easily  done  by  using  a  comparator  and  an  identity  register 
which  contains  the  module  number. 
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Job 

Number  of 

Interl 

eaving 

g 

I 

Memory 

Size 

Modules 
Required 

Waste 

1 

1 

1 

0 

0 

No 

2 

2 

2 

0 

1 

No 

3 

3 

3 

0 

0 

No 

4 

4 

4 

0 

2 

No 

5 

6 

3-3 

1 

0 

Yes 

6 

6 

3-3 

1 

0 

No 

7 

8 

8 

0 

3 

Yes 

8 

8 

8 

0 

3 

No 

9 

12 

3-3- 

3- 

■3 

2 

0 

Yes 

10 

12 

3-3- 

3- 

■3 

2 

0 

Yes 

11 

12 

3-3- 

3- 

■3 

2 

0 

Yes 

12 

12 

3-3- 

3- 

■3 

2 

0 

No 

Table  28.  The  Number  of  Modules  Required  and  Interleaving 
Scheme  for  each  Job  Size.  (  with  only  a  QR-, 
Circuit  ) 
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Job 

Number  of 

Interl 

eaving 

g 

I 

Memory 

Size 

Modules 
Required 

Waste 

1 

1 

1 

0 

0 

No 

2 

2 

2 

0 

1 

No 

3 

3 

3 

0 

0 

No 

4 

4 

4 

0 

2 

No 

5 

5 

5 

0 

0 

No 

6 

6 

3-3 

l 

0 

No 

7 

8 

8 

0 

3 

Yes 

8 

8 

8 

0 

3 

No 

9 

10 

5-5 

i 

0 

Yes 

10 

10 

5-5 

l 

0 

No 

11 

12 

3-3- 

3- 

-3 

2 

0 

Yes 

12 

12 

3-3- 

3- 

-3 

2 

0 

No 

Table  29.  The  Number  of  Modules  Required  and  Interleaving 


Scheme  for  each  Job  Size.   (  with  both  QR-  and 
QRr  Circuits  ) 
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4.2.2  I/O  Connection 

The  other  problem  we  need  to  discuss  is  the  connection  between 
processors  and  I/O  devices.  Recall  in  Figure  1,  the  PRIME  system  uses  an 
External  Access  Network  to  provide  the  communication  paths  be- 
tween processors  and  external  devices.  The  network  is  essentially  a  cross- 
bar switch,  except  it  also  allows  two  processors  to  set  up  a  path  between 
themselves.  This  is  done  by  connecitng  both  processors  to  a  free  switch 
node,  i.e.,  a  node  that  does  not  connect  to  any  external  device  [15]. 

This  network  is  easier  to  control  and  allows  simultaneous  use  of 
all  the  I/O  devices.  The  probability  of  access  conflict,  i.e.,  more  than 
one  processor  accessing  the  same  device,  will  be  small  if  the  number  of 
I/O  devices  is  large  enough.  For  example,  in  the  systems  we  are  simulating, 
the  results  show  that  the  average  number  of  jobs  in  the  I/O  stage  is  less 
than  the  number  of  I/O  devices.  This  implies  that  in  general  a  job  does 
not  need  to  wait  for  an  I/O  device  if  we  use  a  network  like  the  EAN  to 
interconnect  the  processors  and  I/O  devices. 

However,  the  cost  of  this  kind  of  network  will  be  very   expensive 
if  the  system  size  is  large.  This  is  the  typical  disadvantage  of  a  cross- 
bar-like network.  In  our  simulation  model,  we  choose  a  common  bus  structure 
which  is  shown  in  Figure  45.  Each  processor  in  this  structure  is  connected 
to  a  common  bus.  The  number  of  common  buses  we  should  use  is  determined 
by  the  traffic  between  processors  and  I/O  devices.  Usually,  this  is  propor- 
tional to  the  system  size.  The  I/O  devices  are  partitioned  into  groups  and 
the  I/O  devices  in  one  group  are   connected  to  an  I/O  bus.  These  I/O  buses 
are  interconnected  with  the  common  buses  via  a  small  crossbar  switch.  The 
switch  allows  a  processor  to  access  any  I/O  device. 
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Figure  45.   The  Interconnection  Network  between  Processors  and 
I/O  Devices. 
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The  cost  of  this  design  apparently  is  much  cheaper  than  the  cost 
of  a  connection  network  like  the  EAN  in  the  PRIME  system.  If  we  use  n  com- 
mon buses  and  c  I/O  buses,  the  total  cost  will  be  the  cost  of  n+e  buses  plus 
the  cost  of  an  n  by  C  switch.  Suppose  n  and  £  are  small,  this  cost  will  be 
\/ery   low.  In  addition,  when  we  increase  the  number  of  processors,  we  can 
simply  connect  an  additional  processor  to  a  common  bus  if  the  I/O  traffic 
does  not  increase  too  much.  In  an  EAN-like  network,  the  extra  processor  will 
cause  a  significant  increase  of  the  network  size. 

Although  the  sharing  of  a  common  bus  by  a  number  of  processors  can 
reduce  the  number  of  buses  we  need  and  the  size  of  the  switch,  bus  conten- 
tion might  occur  if  more  than  one  processor  connected  to  the  same  bus  want 
to  access  I/O  devices  at  the  same  time.  The  contention  can  be  serious  if 
the  average  number  of  I/O  requests  for  a  job  is  large  or  too  many  processors 
are  connected  to  one  bus.  On  the  other  hand,  bus  contention  might  also  occur 
on  an  I/O  bus  if  more  than  one  processor  is  trying  to  access  the  I/O  devices 
connected  to  the  same  I/O  bus.  Both  of  these  bus  contentions  will  result 
in  queueing  of  an  I/O  request,  which  will  cause  some  delay  to  a  job.  Of 
course,  we  can  keep  the  bus  contentions  small  by  making  n  and  e  large  enough. 
In  order  to  find  out  the  number  of  buses  we  should  use,  let  us  analyze  how 
n  and  c  affect  the  bus  contentions  occurring  in  our  connection  network. 

We  will  derive  the  probability  that  a  processor  can  successfully 
access  the  I/O  device  it  wants  without  being  blocked  due  to  bus  connections. 
Let  us  assume  a   to  be  the  probability  that  a  certain  processor  is  performing 
and  I/O  operation.  Roughly  speaking,  the  ratio  of  the  I/O  time  and  the  total 
service  time  can  be  thought  of  as  a.  Therefore,  a  processor-bound  job  mix 
has  a  small  value  of  a   and  an  I/0-bound  job  mix  has  a  large  value.  We  will 
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assume  o  to  be  the  same  for  all  active  jobs,  or  processors.  Of  course, 
l-o  is  the  probability  that  a  processor  will  not  issue  an  I/O  request.  We 
also  assume  each  common  bus  has  p/n  processors  attached  to  it. 

Obviously,  if  a  processor  wants  to  make  a  successful  access, 
the  required  common  and  I/O  buses  should  both  be  free.  So,  the  probability 
of  a  successful  access  is  the  product  of  the  probabilities  of  these  two 
events.  The  probability  of  the  first  event  is  (l-a)p/n~  ,  which  is  the 
probability  that  all  other  p/n  -  1  processors  sharing  the  same  common  bus 
are  not  doing  I/O.  The  probability  of  the  second  event  is 

nMn_1)  [1  -  (l-a)P/nr  [(l-o)P/n]n^_1  (1  -  1/C)  , 
1=0    * 

where  each  term  in  the  summation  is  the  probability  that  exactly  I   of  the  re- 
maining n-1  common  buses  are  busy  but  none  of  these  I   requests  are  accessing 
the  I/O  bus  we  are  interested  in.  This  summation  is  actually  the  expansion 
of  the  following  equation: 

[(1_a)P/n  +  (1  -  l/a)(l  -  (l-a^)]"-1  =  [1  -  1/a  +  l/a(l-o)P/n]n_1 

Hence,  the  probability  of  successful  access  Ps  can  be  expressed  as  follows: 

Ps  =  (l-a)^1  [1  -  l/o  +  l/o(l-o)P/n]n"1. 

In  Table  30,  we  show  some  numerical  values  of  Ps  for  different  a, 
n,  and  C  values.  Here  we  use  eight  processors.  As  we  can  see,  when  a  is 
small,  Ps  is  rather  large  even  for  moderate  values  of  n.  If  we  use  the 
definition  stated  above,  we  can  see  the  a  value  for  our  job  mix  is  about 
0.3  since  our  simulation  result  shows  that  the  I/O  time  for  a  job  is  roughly 
30%  of  the  total  service  time.  If  we  look  in  the  0.3  column,  we  need  four 
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p  =  8 


\^  a 
n    \^ 

0.5 

0.4 

0.3 

0.2 

0.1 

2 

0.07 

0.12 

0.21 

0.36 

0.60 

3 

0.08 

0.13 

0.22 

0.37 

0.61 

4 

0.12 

0.19 

0.29 

0.44 

0.67 

8 

0.13 

0.21 

0.32 

0.48 

0.70 

(  1=2   ) 


n  N. 

0.5 

0.4 

0.3 

0.2 

0.1 

2 

0.09 

0.15 

0.26 

0.41 

0.65 

3 

0.13 

0.20 

0.30 

0.45 

0.67 

4 

0.21 

0.29 

0.40 

i    0.55 

0.74 

8 

0.28 

0.37 

0.48 

I    0.62 

i 

0.79 

(  1=3  ) 


\     d 

0.5 

0.4 

0.3 

0.2 

0.1 

2 

0.10 

0.17 

0.28 

0.44    J 

0.67 

3 

0.15 

0.23 

0.34 

!    0.50 

0.70 

4 
8 

0.27 
0.39 

0.36 
0.48 

0.47 

i   0.58 

i 

i    0.60 
0.70 

0.78 
0.84 

(  *=4  ) 


Table  30.  The  Probability  of  Successful  Access  to 
I/O  Device. 
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common  buses  and  four  I/O  buses  in  order  to  obtain  a  near  0.5  probability. 
Even  with  eight  common  buses,  we  can  only  achieve  near  0.6  probability.  It 
seems  that  this  design  is  not  very   attractive  due  to  the  high  blocking 
probabil ity. 

However,  when  we  are  doing  an  I/O  operation,  we  do  not  have  to 
occupy  the  buses  all  the  time.  We  can  release  the  buses  during  the  seek 
and  rotational  latency  time  of  an  I/O  operation  and  let  some  other  processors 
use  the  buses.  In  other  words,  a  processor  will  only  occupy  the  common  and 
I/O  buses  when  the  data  or  address  is  being  transferred.  This  effectively 
reduces  the  value  of  a  and  hence  increases  the  probability  of  successful 
access.  For  example,  if  the  data  transfer  time  is  only  one-third  of  an 
I/O  transaction  time,  a   can  be  reduced  from  0.3  to  0.1  and  we  can  get  yery 
high  probability  of  a  successful  I/O  access.  Of  course,  we  need  to  implement 
a  smart  controller  to  control  the  use  of  these  buses. 

In  fact,  if  we  assume  a  bus  will  only  be  occupied  during  data 
transfer,  our  model  of  Figure  45  is  almost  equivalent  to  the  L-M  memory 
organization  model  proposed  by  Briggs  [50].  The  analytic  result  of  his 
model  can  be  slightly  modified  to  fit  our  model  or  used  as  an  approximation 
of  our  model . 

In  our  simulations,  we  used  p  common  buses  and  two  1/0  buses,  i.e., 
n=p  and  C=2.  From  all  simulation  results,  we  can  see  the  queueing  time 
for  1/0  devices  is  relatively  low.  Apparently,  the  structure  of  Figure  45. 
is  quite  good.  If  we  use  faster  1/0  devices,  we  can  also  reduce  the  a  value. 
Hence,  we  can  use  fewer  buses  and  maintain  the  same  performance. 
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4. 3  Further  Problems 

Multiprocessor  have  been  an  important  subject  in  computer  design 
for  quite  a  while.  Recently,  new  technology  has  permitted  us  to  consider 
systems  with  large  numbers  of  autonomous  processors.  Most  of  the  work  done 
previously  in  this  area  has  concentrated  on  speeding  up  single  program  through 
the  use  of  multiple  processors,  or  on  providing  a  multiprogramming  environ- 
ment. This  in  turn  has  led  to  complex  memory-processor  and  interprocessor 
communication  schemes.  We  have  shown  in  this  thesis  that  multiprogramming 
is  not  necessary  for  high  throughput,  low  turnaround  time,  and  that  simpler 
architecture  is  indeed  a  viable  design  alternative,  producing  good  performance 
and  expandability,  and  capable  of  high  reliability  and  availability. 

However,  there  are  still  a  number  of  areas  which  need  further 
study.  We  discussed  the  design  of  several  components  of  a  system  (e.g., 
addressing  hardware),  but  many  of  the  details  of  the  processor,  memory,  and 
I/O  systems  need  more  work.  Some  of  this  design  is  straightforward,  but 
some  requires  better  model  before  we  fully  understand  the  tradeoffs  involved. 
For  example,  in  determining  the  actual  distribution  and  connection  of 
memories  to  processors,  we  have  been  unable  to  demonstrate  a  model  which 
correctly  predicts  the  best  distribution  or  connection.  But  we  have  shown 
that  the  distribution  and  connection  does  cause  significant  changes  in  per- 
formance. 

In  this  thesis,  we  have  purposely  omitted  consideration  of  interactive 
job  loads  and  virtual  memory.  An  interactive  environment  imposes  new  problems  both 
in  connecting  terminal s  to  the  system  and  in  handling  the  large  number  of  small 
tasks.  This  type  of  environment  should  be  investigated  to  determine  whether 
or  not  it  would  necessitate  significant  changes  to  the  architecture  or  our 
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conclusions. 

Finally,  a  very  important  research  area  concerns  reliable  operating 
systems.  Conceptually,  it  is  easier  to  design  an  operating  system  with 
centralized  control.  But  this  approach  leaves  us  open  to  total  system 
failure  if  the  control  hardware  fails.  In  Chapter  1,  we  briefly  describe 
the  design  philosophies  used  by  the  PRIME,  C.mmp,  and  NonStop  systems.  The 
C.mmp  and  NonStop  systems  essentially  let  each  subsystem  own  a  copy  of  the 
operating  system.  This  prevents  the  failure  of  the  operating  system  of  one 
subsystem  from  affecting  the  operations  of  the  other  subsystems.  However, 
this  duplication  does  occupy  a  significant  amount  of  memory.  The  PRIME  sys- 
tem, on  the  other  hand,  partitions  the  operating  system  into  a  central  con- 
trol monitor  and  external  control  monitors  (ECM's)  and  distributes  these  ECM's 
to  different  subsystems.  This  distributed  approach,  of  course,  can  reduce 
the  memory  requirement.  However,  it  does  create  some  problems  when  the 
control  subsystem  (the  one  who  is  running  the  central  control  monitor) 
goes  down,  for  example,  how  to  save  all  the  system  tables  and  pass  control 
to  the  subsystem  that  is  taking  over  the  central  control  monitor.  Complete 
distribution  of  the  central  control,  or  minimization  of  the  central  control 
so  it  could  reside  in  a  highly  reliable  component,  is  an  interesting  and 
important  research  area. 
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Appendix  A 

Assuming  we  have  p  processors  referencing  m  memory  modules  numbered 
from  0  to  m-1.  Each  processor  generates  s  references  in  every   memory  cycle. 
Let  a  be  the  probability  that  the  next  reference  will  access  the  next  module 
in  sequence,  i.e.,  a  =   Pr{r.+,  =  (r.  +  1)  mod  m},  and  any  other  module  with 
the  probability  (l-a)/(m-l).  We  will  call  the  former  case  a  "sequential 
transition"  and  the  later  a  "nonsequential  transition."  Also,  let  p\  .  be 
the  probability  that  the  i       processor  will  generate  a  k   in  the  fi  '  posi- 
tion and  no  /  occurs  in  the  first  2-1  positions,  i.e.,  the  probability  of 
the  event  shown  in  Figure  14. 

In  the  first  position,  assuming  we  know  all  the  reference  proba- 
bil  ity  P^;,  we  have  : 

9/2      '     pl2' 


p'1'  -  p... 


pP)  .  p.  . 

The  probability  of  no  /  shows  up,  trivially,  is  1  -  p.  -  .  In  the  second 

(2) 
position,  consider  two  kinds  of  transition  and  the  definition  of  pv.,  ,  we 

will  get 
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p<?>        =    (i  -  pO>  -p0))l4+p(')«, 

P(2)  ,  .  (i  -Pn)-P(}>    )bi+Pn> ;.,«. 

n(2)  .         (1      _     p(D)     1±L 


p(-2)        -    (i  -pW-pW  nJ^+pty  na. 

The  first  term  is  the  probability  of  nonsequential  transition  and  the  second 
term  is  the  probability  of  sequential  transition.  The  probability  of  no  j 
in  the  first  two  positions  is: 

1  -  p(J)  .   S  pW  =  (1  -  p(}))  [1  -  (?Sl=3lfl  +  (1  -  ^ld))l^)] 

<J     fe=l  "^         ^       ,(1)         1  (1)   m"] 

Actually,  we  can  extend  the  above  result  to  s   position  (if  we  accept  re- 
cursive solution)  just  by  replacing  2  by  s,  we  thus  have: 

p(s)    .  (1  .  p(s-D  .  p(s-l),h«+D(s-l)0 


,(s)     =  M   n(s-D  .  ,(s-l)  Nl^a  x  Js- 

is 


p-/-i\  =  (l  -  P     Pv-7  -'o^^-f  +  Pv-7  'o\  a 
^-(i-l)         ^/     K^(r2);m-l   K4j-2) 


,(s)    =  n  _  D(s-l)0-a 


,(s)       n   n(s-D   Js-D 


Dls;    =  n  _  (s-lj    (s-l  .  i-a    (s-1) 


All  these  can  be  solved  recursively,  and  the  probability  that  Ith   processor 


does  not  reference  j       module  will  be: 
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(s) 


m 

fe»l 
fe+i 


(s) 
Ik 


,(5-1) 


,(s-l) 


(i  .  p(s-D)  n  .  *(/-i)  a  +  (1  .  !^Ltli  )  I4)] 


^j 


(s-l) 


m-1 


If  we  call  this  q\v 
s  >  1  will  be 


then  by  the  same  argument,  the  bandwidth  equation  for 


P   (s) 
m  -     n  q\V 


m 
2 
i-1  -t-1 

We  can  solve  the  above  combinatorial  problem  by  using  another  method, 
namely,  the  Markov  chain  analysis.  Figure  46  shows  the  Markov  chain  for  this 
problem.  Since  we  are  only  interested  in  finding  out  the  probability  that 
no  j   will  show  up  in  s-l  transitions,  we  simply  make  state  j   transfer  back 
to  itself  with  probability  1.  Hence  we  have  an  absorbing  Markov  chain. 

The  transition  matrix  of  this  Markov  chain  is  given  by: 


T  = 


0   a   j3   .. 
0   0   a   0 


0   0 
a   0 


1   0   .., 
..  0   a 


0 

0 
0 


0   a 
...  0 
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If  we  let  7r .,  be  the  probability  that  the  Markov  chain  will  be  in  state 

2+1 

fe  after  P-l  transitions,  then  we  can  find  -n  .       from  the  following  relation: 

8+1      fi  T 

7T       =   f   T 

where  7r  =[».,,*.«,...».   is  the  state  probability  vector. 

If  we  consider  the  request  generation  to  be  the  state  transition  in 
our  Markov  chain,  then  the  probability  that  no  j   will  show  up  after  generating 
s  requests  will  be  equal  to  1  -  it .  .,  given  it     =  [p  -, ,  p  -9,  . . . ,  p .  ]. 
Actually,  this  method  is  exactly  the  same  as  the  previous  one,  except  we 
are  using  matrix  expression  instead  of  using  recursive  expression  now.  Ob- 
viously, the  latter  method  is  neater  and  hence  more  preferable. 
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